Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2022-03-29T11:23:38Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/62Airflow log shows wrong "number of records"2022-03-29T11:23:38ZDebasis ChatterjeeAirflow log shows wrong "number of records"Excerpt from Airflow log -
[2021-11-15 12:35:49,137] {pod_launcher.py:149} INFO - 2021-11-15 12:35:49.068 INFO 1 --- [ main] o.o.o.c.p.i.service.IBMIngestionService : Total records in File are = 4
Although the CSV has 4 rows...Excerpt from Airflow log -
[2021-11-15 12:35:49,137] {pod_launcher.py:149} INFO - 2021-11-15 12:35:49.068 INFO 1 --- [ main] o.o.o.c.p.i.service.IBMIngestionService : Total records in File are = 4
Although the CSV has 4 rows - one being header row and actually 3 rows of actual data.
Suggest we change the message suitably.
Airflow log -
[CSV-Ingestion-custom-Airflow-log-for-IBM-DC.txt](/uploads/333be1c85de2d89bfc12cf7e5de26c3b/CSV-Ingestion-custom-Airflow-log-for-IBM-DC.txt)
Data file used for the run in IBM, R3M9 Preship environment.
[IBM_sample_CSV-DC.csv](/uploads/1b8e0c74d10bc3fc47526d467b751f92/IBM_sample_CSV-DC.csv)M12 - Release 0.15https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/58CSV: Ability to define relationships based on multiple keys2021-10-27T09:54:20ZFernando Nahu Cantera RubioCSV: Ability to define relationships based on multiple keysIngestion workflow improvement
CSV ingestor will allow us to define a key composed of multiple attributes to search a related objectIngestion workflow improvement
CSV ingestor will allow us to define a key composed of multiple attributes to search a related objectM10 - Release 0.13Fernando Nahu Cantera RubioFernando Nahu Cantera Rubiohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/44As part of GSM feature, Implement status publishing method in AWS2022-07-20T20:47:15ZMahesh DakshaAs part of GSM feature, Implement status publishing method in AWSThis is as per the GSM requirement to be implemented in each CSP. This issue has been created for AWS team to implement the publish method to publish the status events in message queue.This is as per the GSM requirement to be implemented in each CSP. This issue has been created for AWS team to implement the publish method to publish the status events in message queue.M12 - Release 0.15https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/43As part of GSM feature, Implement status publishing method in IBM2022-04-08T04:25:06ZMahesh DakshaAs part of GSM feature, Implement status publishing method in IBMThis is as per the GSM requirement to be implemented in each CSP. This issue has been created for IBM team to implement the publish method to publish the status events in message queue.This is as per the GSM requirement to be implemented in each CSP. This issue has been created for IBM team to implement the publish method to publish the status events in message queue.M9 - Release 0.12https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/35CSV Enhancement - Id generation strategy2021-07-08T02:08:41ZSmitha ManjunathCSV Enhancement - Id generation strategyCurrently, there Id generation strategy in CSV parser is -
<ul>
<li> Get all the fields marked as an 'x-osdu-natural key' ; concatenate them and get a base 64 encoding of it </li>
<li> If the schema doesnt have any 'natural key' fields,...Currently, there Id generation strategy in CSV parser is -
<ul>
<li> Get all the fields marked as an 'x-osdu-natural key' ; concatenate them and get a base 64 encoding of it </li>
<li> If the schema doesnt have any 'natural key' fields, then let storage service generate the Id </li>
</ul>
However, some csv files can contain a column called 'id' which can be a unique identifier for a row in the file.
In such situations, it would be beneficial to have the id generation strategy to incorporate the value in that column.
This would make searching for the record much easier as the end user would already know what the id of his record would be.
Another problem is that when we ingest the <b> same file multiple times </b> , with each ingestion, records are created again (with a different, randomly generated id by the storage service).
The proposed format for id generation could be as follows :
<ol>
<li> check if schema has natural keys defined. If yes, store record with id - tenant:type:location:{encodedId}
<li> else, check if file has 'id' column. if yes, use it and store record with id - tenent:type:location:{id}
<li> if both above conditions aren't true, let storage service handle the id generation.
</ol>
Example of schema with no osdu natural keys - https://community.opengroup.org/osdu/platform/system/schema-service/-/blob/master/deployments/shared-schemas/osdu/master-data/Wellbore.1.0.0.jsonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/28CSV - Automatically create schema based on contents of CSV File2021-06-24T12:26:01ZTodd DixonCSV - Automatically create schema based on contents of CSV FileGiven a CSV File and no schema has been defined, automatically create a new schema based on the contents of the CSV file using the Schema Service.Given a CSV File and no schema has been defined, automatically create a new schema based on the contents of the CSV file using the Schema Service.https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/27CSV - Publish events as per Producing Status Messages ADR2021-06-24T12:40:20ZTodd DixonCSV - Publish events as per Producing Status Messages ADRAs per the [Producing Status Messages ADR](https://community.opengroup.org/osdu/platform/system/home/-/issues/80) that was approved last week, we will enhance CSV Ingestor DAG to publish events accordingly.As per the [Producing Status Messages ADR](https://community.opengroup.org/osdu/platform/system/home/-/issues/80) that was approved last week, we will enhance CSV Ingestor DAG to publish events accordingly.Fernando Nahu Cantera RubioFernando Nahu Cantera Rubiohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/38Refactor reusable Java logic from Ingestion DAGs to common Java ingestion lib...2021-06-14T16:48:47ZAlan HensonRefactor reusable Java logic from Ingestion DAGs to common Java ingestion libraryIngestion DAGs have functionality baked into the Python code that should be refactored into an OSDU Python ingestion library.
To deliver on this issue, determine what functionality should be refactored from the Manifest, Energistics, a...Ingestion DAGs have functionality baked into the Python code that should be refactored into an OSDU Python ingestion library.
To deliver on this issue, determine what functionality should be refactored from the Manifest, Energistics, and CSV DAGs and create the corresponding issues to capture that work.https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/4CSV Parser2021-06-24T12:42:56ZStephen Whitley (Invited Expert)CSV Parser- [x] Ability to parse a CSV file.
- [x] Validate the structure of CSV file against a configured CSV schema schema.Column Header Validation, Data validation based on n rows.Schema Service integration.
- [x] Validation of processing multi...- [x] Ability to parse a CSV file.
- [x] Validate the structure of CSV file against a configured CSV schema schema.Column Header Validation, Data validation based on n rows.Schema Service integration.
- [x] Validation of processing multiple CSV files concurrently via multiple dags.
- [ ] Identify the size of CSV which will be supported in one dag run.M1 - Release 0.1Todd DixonSwapnilTodd Dixon