ADR: Change Handling Invalid Inputs in Data Processing Operators
Context
In manifest-based data ingestion pipelines, data processing operators handle a Manifest, which is a structured file containing metadata about reference-, master-, workproduct data for ingestion. Each record in a manifest corresponds to a specific data asset lined up for ingestion. An essential aspect of a manifest-based system is the interconnections and relationships among the records within the manifest. Therefore, if a single record fails validation, it raises questions about the integrity of the entire manifest's data due to potential significant dependencies between records.
In the prior implementation, the operators simply marked an invalid record as 'skipped'. The processing continued with the rest of the valid records in the manifest. The 'skipped' records, although not passed down for further processing, were not completely discarded. The IDs of these records were maintained and made accessible for downstream tasks for possible error handling, debugging, or auditing purposes.
However, the skipped IDs often carry a significant context, dependencies, or linkage associated with the remaining records. They could also be indicative of broader data quality issues. By skipping the invalid records but continuing with the rest of the process, we risked ignoring potentially serious issues and introduced discrepancies and inaccuracies downstream.
Thus, to maintain data assurance and avoid potential inconsistencies, we decided to halt the processing of the entire manifest if an invalid record was identified. It was a shift from previously allowing partial (and potentially out-of-context) data handling to enforcing complete and accurate data processing within the pipeline.
Decision
We decided to revise this approach for handling invalid records:
-
Manifest-Based Input Processing - When processing manifests, if an operator comes across an invalid record, the entire pipeline is halted and the processing of that particular manifest is skipped entirely. This approach promotes the processing of completely valid Manifests only.
-
Skipping Records Handling - Internally, the operator logs in the XComs the IDs of the invalid records for tracking purposes. However, these record IDs are not transferred over to downstream tasks.
-
Error Generation for Invalid Records - The occurrence of an invalid record in a manifest triggers an exception from the operator, effectively halting the entire pipeline. This immediate stop provides a clear signal that an error has occurred during the processing of the specific manifest.
Consequences
The updated approach significantly enhances data quality and integrity by ensuring that each manifest processed is completely valid. It also improves error transparency in the system since the entire pipeline is halted as soon as an invalid record is detected, preventing potentially compromised data from propagating through the pipeline.
Additionally, it allows anomalies to be identified at an early stage, which can be particularly beneficial for large-scale data processing, where issues are more difficult to debug retrospectively.