... | ... | @@ -15,6 +15,15 @@ At the time of writing, [Apache Airflow](https://airflow.apache.org/) 1.10.x is |
|
|
|
|
|
## DAG Development ##
|
|
|
|
|
|
**Lightweight DAG**
|
|
|
|
|
|
Airflow Scheduler processes all DAG files at each heartbeat and executes all the code inside the file. Delegate heavy computations to the DAG’s operators.
|
|
|
|
|
|
**Lightweight XCom**
|
|
|
|
|
|
Each pushing to XCom writes XCom messages to the Data Base with serialization and deserialization of these data, which causes extra overhead.
|
|
|
Using XComs for passing path to data, which may be stored in some storage, is more preferable than passing the large data between tasks.
|
|
|
|
|
|
**Kubernetes Pod Operators**
|
|
|
|
|
|
Airflow is a workflow orchestrator - not a data processing engine (e.g., Spark, Flink, Beam, etc.). As such, its runtime is not designed to dynamically scale with large jobs. While Airflow does support executing workflows entirely within its runtime resources, it is recommended to use the [Kubernetes Pod Operator](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html) to execute the logic needed within your DAG. The workflow orchestration will still occur within the Airflow runtime, but the actual processing logic will occur outside the Airflow runtime ensuring that your operator logic does not overwhelm the Airflow instance.
|
... | ... | @@ -31,12 +40,20 @@ Ingestion is ultimately about storing data into the OSDU<sup>TM</sup> data platf |
|
|
|
|
|
At the time of writing, there was no unified, platform approach to logging and reporting for asynchronous processes. In most cases, the provider has an underlying logging framework that supports aggregation and dashboarding. Airflow also has a console UI the provides access t the logs generated during the DAG run. However, neither of these solutions is user-friendly. In Airflow, each DAG operator, or task, has its own logging area. Meaning, for a DAG that has multiple operators, a user has to look into the logs for each step to understand what happened. To overcome this limitation, the short-term solution is to write a summary logging outcome to XCom enabling the users to view the results in one place despite where the activity occurred in the DAG.
|
|
|
|
|
|
**Caching**
|
|
|
|
|
|
Use caching tools to cache responses from Schema and Search service to prevent requesting the same data.
|
|
|
Simple using Python `@lru_cache` has its limits in sharing the same cache in parallel tasks, as every task runs in its own process
|
|
|
and doesn’t share the same memory with other tasks. That’s why consider using some cache tools (e.g., Redis) for sharing cached values across different tasks.
|
|
|
|
|
|
**Standardization**
|
|
|
|
|
|
Where possible, ensure business logic code that has reusable characteristics is placed in a common DAG library for other community members to use. Also, consider if the functionality you're placing in a DAG operator is better off as an OSDU Data Platform service.
|
|
|
|
|
|
There are also recommended version standards for development. See this [ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/74).
|
|
|
|
|
|
|
|
|
|
|
|
**Useful External References**
|
|
|
|
|
|
* Astronomer guide for Airflow DAG authoring - [here](https://www.astronomer.io/guides/dag-best-practices#:~:text=%20DAG%20Writing%20Best%20Practices%20in%20Apache%20Airflow,its%20own%20container%20with%20limited%20memory...%20More%20)
|
... | ... | |