Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • I Ingestion DAGs
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 32
    • Issues 32
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1
    • Merge requests 1
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Open Subsurface Data Universe SoftwareOpen Subsurface Data Universe Software
  • Platform
  • Data Flow
  • Data IngestionData Ingestion
  • Ingestion DAGs
  • Issues
  • #10
Closed
Open
Issue created Sep 27, 2020 by Siarhei Khaletski (EPAM)@Siarhei_Khaletski🚩Owner1 of 3 checklist items completed1/3 checklist items

File structure updates. Airflow pluggable approach

Change Type:

  • Feature
  • Bugfix
  • Refactoring

Context and Scope

Existing code base has a few disadvantages:

  1. there is no any standardized approach to common modules close to the dags
  2. it is not module struct, i.e. the code base can't be split on independent modules

This ADR proposes a vision how to make the project more plugable, i.e. this is an attempt to standardize the code base and vendors modules including.

Within the Decision section the next topics will be discovered:

  1. Airflow project structure update
  2. How to plug in the local python packages
  3. Сaveats about Airflow internals

Furthermore, it should be noted the proposal implies two flows of improvement:

  1. Strategic (late R3, post R3)
  • Multiple API for deployments
    • Operators (reusable components)
    • DAG
    • Libs

Developed by number of vendors libraries hosted on a platform. DAGs composing, for instance within UI and sends against API endpoint to be processed.

  1. Immediate needs (R3)
  • Single endpoint / approach for module code deployment
    • DAGs
    • Plugins

The second case is disclosed in the proposal bellow.

Decision

Vendors contribution

Proposed approach will allow the next:

  1. Each vendor can keep their code in separate repository
  2. Vendors could contribute to core functionality
  3. Vendors Ingestion extensions will be in a separate Git repository

This is following the steps above each vendor will can develop own extensions separately and just delivers when it needed.

The repositories can take the following representation:

/IngestionDAGs.git #ingestion core functionality
/Venror1.git
/Vendor2.git

Some caveats follows:

  1. Extensions repositories must should proposed code structure (see above)
  2. There is a list of supported libraries that should be updated by Operator. Version of libraries should be documented by CSPs

Code structure update

Our proposal is to split the current code base according to the next structure:

src/
├── dags/
│   ├── commons/
│       └── common_utils.py #for instance common functions to prepare DAG params/constants 
│   ├── vendor_1/
│   │   ├── libs/ 
│   │       └── utils.py # the vendor utilities/functions      
│   │   └── dag.py # the vendor DAGs here
│   └── vendor_2/
│       ├── libs/     
│       ...
├── plugins/
│   ├── commons/
│       └── common_utils.py #for instance common functions to prepare operators params/constants 
│   ├── vendor_1/
│   │   ├── libs/ 
│   │       └── utils.py # the vendor utilities/functions      
│   │   ├── operators/ # the vendor operators here
│   │   ├── hooks/ # the vendor hooks here
│   │   └── ... 
│   └── vendor_2/    
│       ...
tests/
│   └── module (or vendor)
└── requirements.txt

Let's look deeper on the structure.

All the code will be split by modules or vendor folders. The folders will contain a separate libs and dags folders. The dags folder can do hold DAG- files and sub-folders with DAGs as well. Into libs folder can hold utils modules etc.

The tests will hold unit and integration tests that split by module or vendor.

The plugins folder will be split by modules or vendors too. Files from the directory have to follow the Airflow Plugins convention. We propose to use the next approach:

...
plugins/
└── vendor_1/
    ├── commons
        └── vendor_utils.py
    ├── operators
        └── vendor_operator.py
    ├── hooks
        └── vendor_hook.py 
    ├── macros
    ├── ...
    └── __init__.py

Using of Airflow Plugins Mechanism

Airflow has a builtin plugins system that requires to create AirflowPlugin instances. This however, overcomplicates the issue and leads to confusion for many people. Airflow is even considering deprecating using the Plugins mechanism for hooks and operators going forward.

(!) According to the document the Plugins mechanism still must be used only for plugins that make changes to the webserver UI.

How it works:

Let’s assume you have an Airflow Home directory with the following structure.

(!) We will assume that vendor name is vnd

vnd/
  ├── commons
  └── dags
    └── vnd_dag.py
plugins/
└── vnd/
    ├── operators
        └── vnd_operator.py
    ├── hooks
        └── vnd_hook.py
    ├── sensors
        └── vnd_sensor.py
    └── __init__.py

The vnd_dag wants to use vnd_operator and vnd_sensor. Also, vnd_operator wants to use vnd_hook. When Airflow is running, it will add DAGS_FOLDER, PLUGINS_FOLDER, and config/ to PATH. So any python files in those folders should be accessible to import. So from our vnd_dag.py file, we can simply use

from vnd.operators.my_operator import MyOperator
from vnd.sensors.my_sensor import MySensor

Since plugins directory from a bucket root was added into PATH, therefore the imports above start from a vendor module name.

(!) Due to internals of the Airflow it strongly not recommended to put many files into dags/commons, plugins/commons. We recommend to install that as a package by pip

Rational

Some of vendors provided their parsers. It was hard to be just plug-and-run. There were a lot of questions where to put the parsers, how to import and use for operators. Because of an absent of any common approach and documentation, external modules can be cause to runtime errors.

Consequences

  1. MR with updated code base has to be created
  2. README.md has to has information about the structure and conventions.
Edited Nov 10, 2020 by Dmitriy Rudko
Assignee
Assign to
Time tracking