CSV Enhancement - Id generation strategy
Currently, there Id generation strategy in CSV parser is -
- Get all the fields marked as an 'x-osdu-natural key' ; concatenate them and get a base 64 encoding of it
- If the schema doesnt have any 'natural key' fields, then let storage service generate the Id
However, some csv files can contain a column called 'id' which can be a unique identifier for a row in the file. In such situations, it would be beneficial to have the id generation strategy to incorporate the value in that column. This would make searching for the record much easier as the end user would already know what the id of his record would be. Another problem is that when we ingest the same file multiple times , with each ingestion, records are created again (with a different, randomly generated id by the storage service).
The proposed format for id generation could be as follows :
- check if schema has natural keys defined. If yes, store record with id - tenant:type:location:{encodedId}
- else, check if file has 'id' column. if yes, use it and store record with id - tenent:type:location:{id}
- if both above conditions aren't true, let storage service handle the id generation.
Example of schema with no osdu natural keys - https://community.opengroup.org/osdu/platform/system/schema-service/-/blob/master/deployments/shared-schemas/osdu/master-data/Wellbore.1.0.0.json