Manifest Ingestion by Reference - point to a large set of identical files
Discussed with Jean Francois Rainaud recently.
Such as collection has identical manifests for different records, such as for 5000 TNO Wellbores.
It is feasible to make use of File Collection type Dataset. https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/dataset/FileCollection.Generic.1.0.0.md
The program can point to a Dataset record which is file collection and handles processing of all 5000 records.
Thus there can be two alternatives for the new program (Manifest Ingestion by Reference) – one with large (concatenated) JSON file and the other with “collection”.
This thought is actually triggered by user feedback (see below).
Manifest Ingestion Issues:
- While ingesting a set of a batch files, the files picked up by the script and invoking the DAG. a. DAG has a limitation to run only 32 concurrent runs. Hence python scripts trigger 100 files, it is taking only 32 at a time, and once the job finishes, it picks up the other one. b. During concurrent runs, some of the DAGs fail, but Airflow still shows as success, which is the pain area to identify the unsuccess file unless the customer reported that the file did not ingest properly.
- The TNO dataset takes almost 2-3 hrs to ingest (~5000 wells), and we are concerned with the massive volume of (TB) data and how many days it ll take to ingest it. Therefore, the performance of ingestion need to improve more.
Regards, Jegan (Accenture)