Move to Airflow 2.0 ADR
Moving to Airflow 2.0
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Decision
This decision will authorize the port of the ingestion workflow and associated DAG's (see below) to support Airflow 2.0, deprecating support for Airflow 1.10.x after a transitionary period.
Deprecating strategy
The existing experimental api is still be available in Airflow 2.0 here:
/api/experimental/
To restore these APIs while migrating to the stable REST API, set enable_experimental_api option in [api] section to True.
A deprecating strategy will be implemented providing providers with a transitionary period to port to Airflow 2.0. During this period, common code will run on Airflow 2.0 by default, however, configuration (possibly environment variables) will allow code written for 1.10.x to be supported.
A guide to detailing the backward compatibility changes can be found here.
Dependencies
The following task list (provided by EPAM) gives an overview of the dependencies and level of effort required to implement the move.
Task | Estimate | Assigned to | Ticket | |
---|---|---|---|---|
1. | Install Airflow 2.0 to all environments | All CSP's | ||
2. | Airflow 2.0 required DAG code changes | |||
2.1 | Manifest-based ingestion (Python) | 5 days | (GCP) | issue |
2.2 | WITSML parser (Python) | 5 days | (GCP) | issue |
2.3 | SEGY -> OpenVDS | 5 days | (GCP) | |
2.4 | SEGY -> ZGY | 5 days | Seismic | |
2.5 | CSV Parser | 5 days | (GCP) | issue |
3. | Workflow Services | |||
3.1 | Common Code (Java) | 10 days | issue | |
3.2. | AWS | |||
3.3. | Azure |
Motivation
The release of Airflow 2.0 is a significant upgrade from the previous versions, and includes improvements and new features that support our goals for the ingestion workflow see link.
In the context of our goals for progressing OSDU ingestion project:
Ease of on boarding developers
In Airflow <2.0 the "experiment" REST API is being deprecated, with a move a new comprehensive "stable" REST API supported by Airflow >=2.0. Moving to the new Airflow should ensure that the code being created by the OSDU will enjoy greater support online and will be easier for new developers to adopt and extend.
The new Airflow 2.0 Task Flow API simplifies the passing for information between tasks in a DAG. This feature does not solve performance problems related to passing large manifests through the workflow, however that is the focus of another effort "Manifest by reference". An example of the new taskflow api can be found here.
Latency
One of the major features of Airflow 2.0 is a new high availability + low latency scheduler.
Measurements made during the OSDU Airflow 2.0 PoC (conducted by EPAM) it was found that using Airflow 1.10.14, latency between tasks (productive work) could be as much as 30 seconds; with and equivalent code running on Airflow 2.0, this overhead was reduced to 5 seconds.
There has been an issue that Airflow <1.10.15 would be default, only allow the creation of one DAG per second, potentially creating latency issues. Although this was solved with the release 1.10.15, dependency on a minor version has created a variance in the behavior of the ingestion workflow across providers, and moving to Airflow >2.0 will solve this. A complete list of bug fixes and improvements through to the current release of Airflow 2.1.0 can be found here.
Throughput and scalability
In the current version of Airflow, the Airflow scheduler has been found to fail silently once the max_active_runs_per_dag (configuration default is 20) is exceeded. This creates a variance on the behavior of the ingestion workflow based on the specific configuration of the OSDU platform provider. During the Airflow 2.0 PoC it was found that this problem was solved.
[A] user can now launch additional "replicas" of the Scheduler to increase the throughput of their Airflow Deployment.
The option for schedulers creates the potential for providers to scale the ingestion workflow by provisioning more resources, but also removes a single point of failure (when using two or more schedulers), providing a more resilient system.
Project Risks
Risk Category | Risk Description | Likelihood | Impact | Comments |
---|---|---|---|---|
NA | NA | NA | NA | NA |
Organizational Management
Name | Project Role | Time Zone |
---|---|---|
Kateryna | GCP | CST |
Kishore | Azure | IST |
Shrikant | IBM | IST |
Greg Wibben | AWS | CST |
Ben | Manifest | CST |
Chad | Data Loading | CET |
Fernando | CSV | |
Sacha | Seismic |