Redundant steps executed: validate schemas and ensure referential integrity
Expected behavior
The validate schemas and ensure referential integrity operations should only need to be executed once per manifest
Observed behavior
In the top-level DAG definition, you can see “validate schema” and “ensure integrity” operators are executed as part of the DAG:
branch_is_batch_op >> validate_schema_operator >> ensure_integrity_op >> process_single_manifest_file >> update_status_finished_op
But then diving deeper into the process_single_manifest_file
operator, it ALSO validates schemas and ensures referential integrity, resulting in redundant API calls.
This problem will go unnoticed for small workloads, but for larger workloads the increased latency will start to quickly add up. Using the TNO and Volve sample ingestion dataset as an example, there are about 24,000 manifest files. If this redundancy adds just 2 extra API calls per manifest ( one for schema validation, one for referential integrity checks ), and each API request takes 250 milliseconds, then this would increase the overall ingestion time by:
24,000 manifests * ( .25 seconds redundancy * 2 requests ) / 60 seconds per minute = 200 minutes, or 3.3 hours.
As the ingestion workload size increases, this redundancy becomes a non-trivial amount of time. Naturally, your mileage may vary! I'm sure we'll see different latency results for different networks, different customers, different Cloud Providers, etcetera. I'm confident customer experiences will be improved by reducing this latency.
Some proposed solutions
- remove the duplicated referential integrity call, the results of this operation aren't used anyway
- change the way the manifest is obtained for the
process_single_manifest_file
operator, allowing the removal of the duplicated schema validation call. We could use Airflow mechanisms ( Xcoms, variables, etcetera ) to reuse the manifest from thevalidate_schema_operator
, but that might affect the atomicity of the operator.
What do you think?