Airflow: Performance design review
R3 ingestion development work uncovered multiple performance issues with Airflow 1.10.x. Considerations for optimization range from the infrastructure for managing Airflow to consider an approach other than Airflow. Engage the Enterprise Architecture team to review the existing Workflow Service design using Airflow and determine if:
There are near-term and longer-term considerations. Near-term assumes R3M5/R3M6 development efforts. Longer-term provides space for new architectural considerations, such as cloud-native implementations with standardized workflows for write-once run-anywhere capabilities.
Near-Term
- Update Airflow infra to optimize always-available Airflow instances to minimize the lag between ingestion initiation and ingestion start (cost is secondary, though cost-optimized profiles are valid)
- Configure Airflow within the infrastructure as always-on vs. spin-up-on-demand. This approach increases cost but improves performance as it minimizes the delay in initiating a workflow.
- Introduce a throttling mechanism for workflow run requests to ensure Airflow is not overwhelmed to the point of failure with large numbers of request (this also needs to consider the Storage Service max-records of 500)
- Understand what scaling capabilities the CSPs have implemented and whether those are captured as best practices
- Determine SLAs for workflows in terms of parallelism, CPU and memory consumption, etc.
**Longer-Term ** (will break out into separate issue)
- A migration to Airflow 2.x should be considered
- What infrastructure updates could be made to support better scalability
- Determine SLAs for workflows in terms of parallelism, CPU and memory consumption, etc.
- Consider additional data processing capabilities (e.g., Apache Spark or Apache Beam)