... | ... | @@ -19,6 +19,10 @@ At the time of writing, [Apache Airflow](https://airflow.apache.org/) 1.10.x is |
|
|
|
|
|
Airflow is a workflow orchestrator - not a data processing engine (e.g., Spark, Flink, Beam, etc.). As such, its runtime is not designed to dynamically scale with large jobs. While Airflow does support executing workflows entirely within its runtime resources, it is recommended to use the [Kubernetes Pod Operator](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html) to execute the logic needed within your DAG. The workflow orchestration will still occur within the Airflow runtime, but the actual processing logic will occur outside the Airflow runtime ensuring that your operator logic does not overwhelm the Airflow instance.
|
|
|
|
|
|
**Passing by Reference**
|
|
|
|
|
|
Testing has indicated that passing payloads to Airflow by reference offer increased performance over passing payloads by value. Using Manifest Ingestion as an example, rather than pass the contents of the Manifest to the Workflow Service, first store the Manifest content in OSDU (e.g., using Dataset Service) and then pass the record id to the Workflow Service, which will pass it to your DAG, which can then read the Manifest content. This will help minimize the overhead associated with serializing/deserializing content via XCom.
|
|
|
|
|
|
**Batching**
|
|
|
|
|
|
Ingestion is ultimately about storing data into the OSDU<sup>TM</sup> data platform. There are built-in protection mechanisms with some APIs to prevent overloading the system. The Storage Service is one such service as its API allows at max [500 records per call](https://community.opengroup.org/osdu/platform/system/storage/-/blob/master/storage-core/src/main/java/org/opengroup/osdu/storage/api/RecordApi.java#L77) to store data. When writing operators that interact with the Storage Service, consider this limitation and batch your writes to the Storage Service.
|
... | ... | |