Orchestration and Ingestion Services workflow engine
OpenDES requires an ingestion service to ingest data and enrich its metadata. The Orchestration and Ingestion flow:
- Provides an ingestion flow which unifies OSDU R1 and DELFI Data Ecosystem requirements
- Refactors the orchestration policy from the DELFI Data Ecosystem application code into a workflow engine
- Under review
Context & Scope
We need a mechanism to support agile ingestion workflow development which supports different data types and enrichment requirements, and satisfies the following criteria:
- Handle metadata ingestion without loading the actual data
- Provide support of sequential and parallel tasks execution
- Have state persistence
- Support both sync and async operations
- Compensation logic (try/catch)
The Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.
- Have an admin/ops dashboard
- Provide resume option for a failed job
- Is available on each of the OSDU cloud providers
Given all that, it's suggested to use Airflow as the main engine, since it's open-source, can be deployed into k8s cluster, has a large community, and is on each of the OSDU cloud providers.
The scope of this decision is to agree on the implementation
When to revisit
Tradeoff Analysis - Input to decision
Without a workflow engine, workflow policy must be built into the services themselves, as it is in the DELFI Data Ecosystem; this has implications for development agility.
Without a cross-platform workflow engine, there will be different implementations of workflow policy and the actual tasks themselves (which is where the real E&P value/knowledge is embedded); this also has implications for OSDU development agility.
Alternatives and implications
The Orchestration and Ingestion flow proposes and trials an Apache Airflow approach.
At a high-level, the fundamental choice is between a solution which is as cloud native as possible on each of the cloud providers, and a solution which is agnostic of the infrastructure in which its hosted and orchestrated. As described in the system principles it is inevitable that conflicts will exist amongst many of the principles. The intent is to make informed trade-off decisions, on a service by service basis, on when to use a portable solution vs. accommodating solutions that are optimized for cost, performance and reliability, and when possible, select an Open Source API that has Cloud optimized implementations as a means of balancing portability and optimization.
Apache Airflow is an example of such an Open Source solution.
- Spark: not well suited to thin services
- Apache nifi: more ETL focused
- BPEL and BPMN: sparse open-source support
- Zeebe BPMN-based Workflow Engine for Microservices Orchestration; version 0.20, the first “production-ready” release. Once a node goes down, the entire cluster goes onto crash loop and never came up itself.
- Argo: YAML rather than Python-based, proposed by Raj
Decision criteria and tradeoffs
Manifest-only ingestion support
Airflow supports ingestion of a manifest without data, eg. to enrich already ingested data.
- Manifest-based file ingestion and single-file ingestion are in scope for R2
- Implementing a DAG for ingesting a manifest without data is outside R2 scope
As noted above, the Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.
Apache Airflow runs on K8s.
- Adobe Experience Platform is scaling to 1000(s) of concurrent Airflow workflows.
There are a number of Executors from which each cloud provider can choose; Kubernetes Executor dynamically delegates work and resources; for each and every task that needs to run, the Executor talks to the Kubernetes API to dynamically launch an additional Pod, each with its own Scheduler and Webserver, which it terminates when that task is completed:
- In times of high traffic, you can scale up
- In times of low traffic, you can scale to near-zero
- For each individual task, you can configure:
- Memory allocation
- Service accounts
- Airflow image
In addition, each cloud vendor may have their own managed service capability for scaling; capabilities will depend on the vendor.
Cloud native support
There are two aspects to this:
- Support of the Ingestion services and DAGS: this will be the same support model as for the rest of OSDU
- Support of Apache Airflow: in Google's case there is cloud native support for Airflow; AWS provides support through Bitnami, as does Microsoft Azure.
Integration with other cloud native provider capabilities
The Ingestion SDK uses a similar model to the rest of OSDU to provide cloud-provider SPIs to hook in Azure Storage, S3, GCS, IBM Object Storage, etc.
Authorization: provider IDP integration
The Ingestion services authorization is provided in the same ways as other OSDU services. Authorization today is very basic in OSDU; this may evolve as OSDU adopts more of the Data Ecosystem authorization model. There's also an outstanding decision around auth for long-running services, and we have the opportunity to shape that.
Airflow web console supports LDAP providers such as Active Directory for authentication.
Observability and audit
Airflow supports StatsD for metrics
- MS has contributed to the Application Insights StatsD backend, but it appears not to be active.
- The Workflow /getStatus endpoint supports tracking the status of a workflow instance.
- File Service status tracking is available through the audit endpoint (code here); this enables the DAG or client to compare the list of files uploaded to the list of files submitted and take appropriate action for a missing file; the action may differ depending on the file type, so this isn't expressly addressed in the service.
- The Stale Jobs Scheduler is intended to identify jobs which haven't finished; this DAG is out of scope for R2.
- Airflow provides pipeline tracking dashboards.
- The statuses of DAG sub-processes can be bubbled up.
- You can also define Airflow trigger rules to take actions such as retry when a task fails.