File structure updates. Airflow pluggable approach
Context and Scope
Existing code base has a few disadvantages:
- there is no any standardized approach to common modules close to the dags
- it is not module struct, i.e. the code base can't be split on independent modules
This ADR proposes a vision how to make the project more plugable, i.e. this is an attempt to standardize the code base and vendors modules including.
Within the Decision section the next topics will be discovered:
- Airflow project structure update
- How to plug in the local python packages
- Сaveats about Airflow internals
Furthermore, it should be noted the proposal implies two flows of improvement:
- Strategic (late R3, post R3)
- Multiple API for deployments
- Operators (reusable components)
Developed by number of vendors libraries hosted on a platform. DAGs composing, for instance within UI and sends against API endpoint to be processed.
- Immediate needs (R3)
- Single endpoint / approach for module code deployment
The second case is disclosed in the proposal bellow.
Proposed approach will allow the next:
- Each vendor can keep their code in separate repository
- Vendors could contribute to core functionality
- Vendors Ingestion extensions will be in a separate Git repository
This is following the steps above each vendor will can develop own extensions separately and just delivers when it needed.
The repositories can take the following representation:
/IngestionDAGs.git #ingestion core functionality /Venror1.git /Vendor2.git
Some caveats follows:
- Extensions repositories must should proposed code structure (see above)
- There is a list of supported libraries that should be updated by Operator. Version of libraries should be documented by CSPs
Code structure update
Our proposal is to split the current code base according to the next structure:
src/ ├── dags/ │ ├── commons/ │ └── common_utils.py #for instance common functions to prepare DAG params/constants │ ├── vendor_1/ │ │ ├── libs/ │ │ └── utils.py # the vendor utilities/functions │ │ └── dag.py # the vendor DAGs here │ └── vendor_2/ │ ├── libs/ │ ... ├── plugins/ │ ├── commons/ │ └── common_utils.py #for instance common functions to prepare operators params/constants │ ├── vendor_1/ │ │ ├── libs/ │ │ └── utils.py # the vendor utilities/functions │ │ ├── operators/ # the vendor operators here │ │ ├── hooks/ # the vendor hooks here │ │ └── ... │ └── vendor_2/ │ ... tests/ │ └── module (or vendor) └── requirements.txt
Let's look deeper on the structure.
All the code will be split by modules or vendor folders. The folders will contain a separate libs and dags folders. The dags folder can do hold DAG- files and sub-folders with DAGs as well. Into libs folder can hold utils modules etc.
The tests will hold unit and integration tests that split by module or vendor.
The plugins folder will be split by modules or vendors too. Files from the directory have to follow the Airflow Plugins convention. We propose to use the next approach:
... plugins/ └── vendor_1/ ├── commons └── vendor_utils.py ├── operators └── vendor_operator.py ├── hooks └── vendor_hook.py ├── macros ├── ... └── __init__.py
Using of Airflow Plugins Mechanism
Airflow has a builtin plugins system that requires to create AirflowPlugin instances. This however, overcomplicates the issue and leads to confusion for many people. Airflow is even considering deprecating using the Plugins mechanism for hooks and operators going forward.
(!) According to the document the Plugins mechanism still must be used only for plugins that make changes to the webserver UI.
How it works:
Let’s assume you have an Airflow Home directory with the following structure.
(!) We will assume that vendor name is vnd
vnd/ ├── commons └── dags └── vnd_dag.py plugins/ └── vnd/ ├── operators └── vnd_operator.py ├── hooks └── vnd_hook.py ├── sensors └── vnd_sensor.py └── __init__.py
The vnd_dag wants to use vnd_operator and vnd_sensor. Also, vnd_operator wants to use vnd_hook. When Airflow is running, it will add DAGS_FOLDER, PLUGINS_FOLDER, and config/ to PATH. So any python files in those folders should be accessible to import. So from our vnd_dag.py file, we can simply use
from vnd.operators.my_operator import MyOperator from vnd.sensors.my_sensor import MySensor
Since plugins directory from a bucket root was added into PATH, therefore the imports above start from a vendor module name.
(!) Due to internals of the Airflow it strongly not recommended to put many files into dags/commons, plugins/commons. We recommend to install that as a package by pip
Some of vendors provided their parsers. It was hard to be just plug-and-run. There were a lot of questions where to put the parsers, how to import and use for operators. Because of an absent of any common approach and documentation, external modules can be cause to runtime errors.
- MR with updated code base has to be created
- README.md has to has information about the structure and conventions.