Process manifest task of Osdu_ingest DAG calls the Storage Service for the Dataset Id in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record version
It has been observed that the process manifest task of Osdu_ingest DAG calls the Storage Service for the DatasetId in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record version. Also, in dataload_r3.py the FileId updated in WorkProductComponent is file_id:file_version. As result following the challenges are encountered.
- If WorkProuct and WorkProuctComponent are not processed by Airflow Manifest Ingestion DAG due to failure in referential integrity validation then the file source information used in first attempt cannot be used for reprocessing the manifest because the file version in the file source json is not the latest anymore and validation of referential integrity fails when reused. As a result ingestion of WPC is tightly coupled with upload of Datasets which generates File Source information used to replace the surrogate key in the WPC manifest by the dataload_r3.py script.
- open-test-data/rc--3.0.0/4-instances/TNO/work-products/markers/.json and open-test-data/rc--3.0.0/4-instances/TNO/work-products/markers_1_1_0/.json are using same dataset. Similarly, open-test-data/rc--3.0.0/4-instances/TNO/work-products/'well logs'/.json and open-test-data/rc--3.0.0/4-instances/TNO/work-products/'well logs_1_1_0'/.json. Same is the case with the manifests for Volve. Because of the current behavior of DAG and dataload_r3.py described above, same File Source information generated from the upload of Datasets cannot be reused. As a workaround, the dataset files s3://osdu-seismic-test-data/r1/data/provided/markers/ are copied into two different directories markers and markers_1_1_0 so that the files are uploaded separately generating unique FileId. Similarly for well logs.