ADR: Keep Only DAG Files in the DAG Folder
Introduction:
The purpose of this ADR is to review and approve the proposed changes to the Directed Acyclic Graph (DAG) structure and the relocation of Python files to the osdu_airflow package. The objectives of the proposed changes are to enhance the performance of the DAG and to avoid the unnecessary import issues.
DAG files of the CSP.
src/dags/eds_scheduler/eds_scheduler_dag.py and
src/dags/eds_ingest/src_dags_fetch_ingest_scheduler_dag.py
Purpose of Restructuring:
- Mitigate Potential Import Issues: By reframing the DAG structure and organizing the Python files into a coherent package like osdu_airflow, potential import issues can be mitigated. This ensures that the import statements in the DAGs and related modules accurately reflect the new directory structure, reducing the likelihood of import errors and improving the overall stability of the system.
- Enhance Performance: One of the benefits of reframing the DAG structure is to improve the performance of the DAGs. When the python files are organized within a specific package, such as osdu_airflow, the Airflow scheduler can focus on parsing and scheduling only the relevant DAG files during each run. This eliminates the need for the scheduler to parse all other files, reducing unnecessary processing time and enhancing the overall efficiency of the scheduling process.
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Scope
The scope of this ADR includes the following scenario:
-
Reframing the Directed Acyclic Graph (DAG) structure: Only keeping DAG files in the folder.
-
Moving the Python files associated with the DAGs to the osdu_airflow package and accessed from it.
Current DAG Structure:
To Be-Structure:
DAG FOLDER:
Python Package structure in the OSDU_AIRFLOW:
Implementation:
-
Keep all the python packages in the osdu_airflow repository within the folder structure.
-
Create package registry for the osdu_airflow lib using CI/CD pipeline.
- Create a branch at the repository(https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/tree/master/)
- Push the code to the branch.
- CI/CD will run and create the dev package registry.
-
Within the dev environment refer to the package registry version, under the Required libraries from the Python Package Index (PyPI) Eg:
Technical Changes Required:
- Update Import Statements:
-
Modify the import statements to import the required Python files from the osdu_airflow package instead of the previous directory structure.
Eg: from osdu_airflow.eds.eds_scheduler.eds_email_automation import EmailAutomation.
-
Changes/Impacts of Restructuring:
- Code Split: EDS code except than DAG files will be moved from current repository(https://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/core-external-data-workflow/-/tree/master/src/dags) to new one (https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/tree/master/). Within open forum
- Deployment of the module will be challenging as right now it’s a manual job to add the package registry version to the dev environment, there might be a need to make some changes to existing CI/CD pipeline to automate the deployment.