Redundant steps executed: validate schemas and ensure referential integrity

Expected behavior

The validate schemas and ensure referential integrity operations should only need to be executed once per manifest

Observed behavior

In the top-level DAG definition, you can see “validate schema” and “ensure integrity” operators are executed as part of the DAG:

branch_is_batch_op >> validate_schema_operator >> ensure_integrity_op >> process_single_manifest_file >> update_status_finished_op

But then diving deeper into the process_single_manifest_file operator, it ALSO validates schemas and ensures referential integrity, resulting in redundant API calls.

This problem will go unnoticed for small workloads, but for larger workloads the increased latency will start to quickly add up. Using the TNO and Volve sample ingestion dataset as an example, there are about 24,000 manifest files. If this redundancy adds just 2 extra API calls per manifest ( one for schema validation, one for referential integrity checks ), and each API request takes 250 milliseconds, then this would increase the overall ingestion time by:

24,000 manifests * ( .25 seconds redundancy * 2 requests ) / 60 seconds per minute = 200 minutes, or 3.3 hours.

As the ingestion workload size increases, this redundancy becomes a non-trivial amount of time. Naturally, your mileage may vary! I'm sure we'll see different latency results for different networks, different customers, different Cloud Providers, etcetera. I'm confident customer experiences will be improved by reducing this latency.

Some proposed solutions

remove the duplicated referential integrity call, the results of this operation aren't used anyway
change the way the manifest is obtained for the process_single_manifest_file operator, allowing the removal of the duplicated schema validation call. We could use Airflow mechanisms ( Xcoms, variables, etcetera ) to reuse the manifest from the validate_schema_operator, but that might affect the atomicity of the operator.

What do you think?

Edited Mar 24, 2021 by Brady Spiva [AWS]