|
|
## Overview ##
|
|
|
|
|
|
This page is an organic page that will grow over time. This page's intent is to offer the community recommendations and best practices regarding the development of Ingestion DAGs ([Directed Acyclic Graphs](https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#dags)). There is often more than one right answer, so please view this page as a set of recommendations for consideration rather than a prescription of what you should do. Ultimately, your use cases should drive your development and deployment approach.
|
|
|
|
|
|
|
|
|
## Airflow Considerations ##
|
|
|
|
|
|
At the time of writing, [Apache Airflow](https://airflow.apache.org/) 1.10.x is the designated workflow engine for OSDU. Airflow provides DAG authors the ability to write DAGs once and have them run on any OSDU implementation. This section addresses some of the steps you can take to optimize Airflow. Airflow also ships with recommendations on how to optimize - see [here](https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-can-my-airflow-dag-run-faster).
|
|
|
|
|
|
- If possible, consider upgrading to Airflow 2.0. Airflow 2.0 addresses some of the lessons learned from Airflow 1.x, such as allowing multiple schedulers which enable high availability.
|
|
|
- Configure your Airflow infrastructure to remain running at all times (vs. spinning down to zero Airflow instances if no jobs are running). This will minimize the initial time hit when a new request comes in as the Airflow infrastructure spins up.
|
|
|
- Increase the configured payload size for the webserver (see the `dag_run` table)
|
|
|
- Increase the max Xcom size by changing out the backend database. See [here](https://marclamberti.com/blog/airflow-xcom/) for additional XCom options.
|
|
|
- Increase the `max_active_runs_per_dag` parameter (see more details [here](https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-can-my-airflow-dag-run-faster))
|
|
|
|
|
|
## DAG Development ##
|
|
|
|
|
|
**Kubernetes Pod Operators**
|
|
|
|
|
|
Airflow is a workflow orchestrator - not a data processing engine (e.g., Spark, Flink, Beam, etc.). As such, its runtime is not designed to dynamically scale with large jobs. While Airflow does support executing workflows entirely within its runtime resources, it is recommended to use the [Kubernetes Pod Operator](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html) to execute the logic needed within your DAG. The workflow orchestration will still occur within the Airflow runtime, but the actual processing logic will occur outside the Airflow runtime ensuring that your operator logic does not overwhelm the Airflow instance.
|
|
|
|
|
|
**Batching**
|
|
|
|
|
|
Ingestion is ultimately about storing data into the OSDU<sup>TM</sup> data platform. There are built-in protection mechanisms with some APIs to prevent overloading the system. The Storage Service is one such service as its API allows at max [500 records per call](https://community.opengroup.org/osdu/platform/system/storage/-/blob/master/storage-core/src/main/java/org/opengroup/osdu/storage/api/RecordApi.java#L77) to store data. When writing operators that interact with the Storage Service, consider this limitation and batch your writes to the Storage Service.
|
|
|
|
|
|
**Logging and Reporting**
|
|
|
|
|
|
At the time of writing, there was no unified, platform approach to logging and reporting for asynchronous processes. In most cases, the provider has an underlying logging framework that supports aggregation and dashboarding. Airflow also has a console UI the provides access t the logs generated during the DAG run. However, neither of these solutions is user-friendly. In Airflow, each DAG operator, or task, has its own logging area. Meaning, for a DAG that has multiple operators, a user has to look into the logs for each step to understand what happened. To overcome this limitation, the short-term solution is to write a summary logging outcome to XCom enabling the users to view the results in one place despite where the activity occurred in the DAG.
|
|
|
|
|
|
**Standardization**
|
|
|
|
|
|
Where possible, ensure business logic code that has reusable characteristics is placed in a common DAG library for other community members to use. Also, consider if the functionality you're placing in a DAG operator is better off as an OSDU Data Platform service.
|
|
|
|
|
|
There are also recommended version standards for development. See this [ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/74). |
|
|
\ No newline at end of file |