Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2022-01-18T04:12:23Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/69Manifest by reference - Use dataset service to move the manifest to the stora...2022-01-18T04:12:23ZBen LasscockManifest by reference - Use dataset service to move the manifest to the storage area.Manifest by reference requires a method to move the manifest from the landing zone to temporary storage on the platform. We propose using the dataset service for this, with the expectation that the dataset service can move the file and r...Manifest by reference requires a method to move the manifest from the landing zone to temporary storage on the platform. We propose using the dataset service for this, with the expectation that the dataset service can move the file and return a signed url for use by the ingestion workflow. This URL will be communicated to the workflow service (by a POST request).https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/70Manifest by reference - Create an operator to push/pull manifests based on re...2021-06-30T12:27:27ZBen LasscockManifest by reference - Create an operator to push/pull manifests based on record idIn the proposed system, only the manifest "record id" will be propagated by through Airflow using XCOM. In this issue, develop a method to obtain a manifest using the dataset service given a "record id" (which might be a signed url to a ...In the proposed system, only the manifest "record id" will be propagated by through Airflow using XCOM. In this issue, develop a method to obtain a manifest using the dataset service given a "record id" (which might be a signed url to a file).https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/6Need longer waiting time within EDS.2023-09-22T13:57:53ZBruce JinNeed longer waiting time within EDS.In EDS ingestion, it will wait 60 seconds for manifest ingestion dag to ramp up. But sometimes the 60s is not enough for the ingestion dag to update a task_status if the processing data is very big. As a result the EDS_ingest dag will fa...In EDS ingestion, it will wait 60 seconds for manifest ingestion dag to ramp up. But sometimes the 60s is not enough for the ingestion dag to update a task_status if the processing data is very big. As a result the EDS_ingest dag will fail, even the manifest_ingestion is actually succeeded.
Highly recommend to extend the waiting time, or make it adjustable.M21 - Release 0.24Nisha ThakranPriyanka BhongadeBruce JinNisha Thakranhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/10ADR: Change Handling Invalid Inputs in Data Processing Operators2024-03-05T14:44:44ZYan Sushchynski (EPAM)ADR: Change Handling Invalid Inputs in Data Processing Operators## Context
In manifest-based data ingestion pipelines, data processing operators handle a Manifest, which is a structured file containing metadata about reference-, master-, workproduct data for ingestion. Each record in a manifest corr...## Context
In manifest-based data ingestion pipelines, data processing operators handle a Manifest, which is a structured file containing metadata about reference-, master-, workproduct data for ingestion. Each record in a manifest corresponds to a specific data asset lined up for ingestion. An essential aspect of a manifest-based system is the interconnections and relationships among the records within the manifest. Therefore, if a single record fails validation, it raises questions about the integrity of the entire manifest's data due to potential significant dependencies between records.
In the prior implementation, the operators simply marked an invalid record as 'skipped'. The processing continued with the rest of the valid records in the manifest. The 'skipped' records, although not passed down for further processing, were not completely discarded. The IDs of these records were maintained and made accessible for downstream tasks for possible error handling, debugging, or auditing purposes.
However, the skipped IDs often carry a significant context, dependencies, or linkage associated with the remaining records. They could also be indicative of broader data quality issues. By skipping the invalid records but continuing with the rest of the process, we risked ignoring potentially serious issues and introduced discrepancies and inaccuracies downstream.
Thus, to maintain data assurance and avoid potential inconsistencies, we decided to halt the processing of the entire manifest if an invalid record was identified. It was a shift from previously allowing partial (and potentially out-of-context) data handling to enforcing complete and accurate data processing within the pipeline.
## Decision
We decided to revise this approach for handling invalid records:
1. **Manifest-Based Input Processing** - When processing manifests, if an operator comes across an invalid record, the entire pipeline is halted and the processing of that particular manifest is skipped entirely. This approach promotes the processing of _completely valid Manifests only_.
1. **Skipping Records Handling** - Internally, the operator logs in the XComs the IDs of the invalid records for tracking purposes. However, these record IDs are _not transferred_ over to downstream tasks.
1. **Error Generation for Invalid Records** - The occurrence of an invalid record in a manifest triggers an exception from the operator, effectively halting the entire pipeline. This immediate stop provides a clear signal that an error has occurred during the processing of the specific manifest.
## Consequences
The updated approach significantly enhances data quality and integrity by ensuring that each manifest processed is completely valid. It also improves error transparency in the system since the entire pipeline is halted as soon as an invalid record is detected, preventing potentially compromised data from propagating through the pipeline.
Additionally, it allows anomalies to be identified at an early stage, which can be particularly beneficial for large-scale data processing, where issues are more difficult to debug retrospectively.M23 - Release 0.26Debasis ChatterjeeChad LeongDebasis Chatterjee