Please explore Apache Airflow limitations to finalize design for R3
There have been a few concerns raised by Airflow as a generic DAG/orchestration solution for data flow in OSDU. It would be good to capture these issues here and to respond back with observations/solutions so the decision decision can be properly captured.
- Airflow is cloud-native only in GCP, which can make it cumbersome to host in other CSPs where the management of the infrastructure becomes a platform/operator responsibility unlike PaaS solutions.
- With Airflow, it will be quite hard to isolate workflows as the workflows are within the same execution environment. As OSDU approaches "OSDU SaaS" and OSDU for smaller operators where it may be hosted by a SI or CSP, this can make it challenging for multi-tenant deployments.
- Airflow DAGs are python only and some parsers and libraries can be Java or C++. Just as a comparison something like Argo which is kubernetes based could help have worksteps in different language/environments as each becomes a separate container instance rather than a python script.
- Airflow apparently has an execution delay between tasks - it is unclear if this is a framework limitation or specific experience of a setup, but perhaps worth capturing to analyze.
- Similarly there are concerns about temporary state/data and an intermediary persistence to hold across DAG worksteps. Beyond what can be held in memory, does Airflow provide a persistable temporary cache for such state?
- Is the Airflow DSL cumbersome to author for ingestion/enrichment workflow providers (ISVs, SIs, operators). In comparison to YAML or other alternatives is this a good choice.
Once the elaboration work is complete, kindly capture this as a LADR for the Data flow project. Thanks for the advice on these issues.