Orchestration and Ingestion Services workflow engine

OpenDES requires an ingestion service to ingest data and enrich its metadata. The Orchestration and Ingestion flow:

Provides an ingestion flow which unifies OSDU R1 and DELFI Data Ecosystem requirements
Refactors the orchestration policy from the DELFI Data Ecosystem application code into a workflow engine

Status

Context & Scope

We need a mechanism to support agile ingestion workflow development which supports different data types and enrichment requirements, and satisfies the following criteria:

Functional:

Handle metadata ingestion without loading the actual data
Provide support of sequential and parallel tasks execution
Have state persistence
Support both sync and async operations
Compensation logic (try/catch)

Scaling

The Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.

Operational:

Have an admin/ops dashboard
Provide resume option for a failed job
Is available on each of the OSDU cloud providers

Given all that, it's suggested to use Airflow as the main engine, since it's open-source, can be deployed into k8s cluster, has a large community, and is on each of the OSDU cloud providers.

The scope of this decision is to agree on the implementation

Decision

Rationale

Consequences

When to revisit

Tradeoff Analysis - Input to decision

Without a workflow engine, workflow policy must be built into the services themselves, as it is in the DELFI Data Ecosystem; this has implications for development agility.

Without a cross-platform workflow engine, there will be different implementations of workflow policy and the actual tasks themselves (which is where the real E&P value/knowledge is embedded); this also has implications for OSDU development agility.

Alternatives and implications

The Orchestration and Ingestion flow proposes and trials an Apache Airflow approach.

At a high-level, the fundamental choice is between a solution which is as cloud native as possible on each of the cloud providers, and a solution which is agnostic of the infrastructure in which its hosted and orchestrated. As described in the system principles it is inevitable that conflicts will exist amongst many of the principles. The intent is to make informed trade-off decisions, on a service by service basis, on when to use a portable solution vs. accommodating solutions that are optimized for cost, performance and reliability, and when possible, select an Open Source API that has Cloud optimized implementations as a means of balancing portability and optimization.

Apache Airflow is an example of such an Open Source solution.

Open-source alternatives

Spark: not well suited to thin services
Apache nifi: more ETL focused
BPEL and BPMN: sparse open-source support
- Zeebe BPMN-based Workflow Engine for Microservices Orchestration; version 0.20, the first “production-ready” release. Once a node goes down, the entire cluster goes onto crash loop and never came up itself.
Argo: YAML rather than Python-based, proposed by Raj

Decision criteria and tradeoffs

Manifest-only ingestion support

Airflow supports ingestion of a manifest without data, eg. to enrich already ingested data.

Manifest-based file ingestion and single-file ingestion are in scope for R2
Implementing a DAG for ingesting a manifest without data is outside R2 scope

Scalability

As noted above, the Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.

Apache Airflow runs on K8s.

Adobe Experience Platform is scaling to 1000(s) of concurrent Airflow workflows.

There are a number of Executors from which each cloud provider can choose; Kubernetes Executor dynamically delegates work and resources; for each and every task that needs to run, the Executor talks to the Kubernetes API to dynamically launch an additional Pod, each with its own Scheduler and Webserver, which it terminates when that task is completed:

In times of high traffic, you can scale up
In times of low traffic, you can scale to near-zero
For each individual task, you can configure:
- Memory allocation
- Service accounts
- Airflow image

In addition, each cloud vendor may have their own managed service capability for scaling; capabilities will depend on the vendor.

Cloud native support

There are two aspects to this:

Support of the Ingestion services and DAGS: this will be the same support model as for the rest of OSDU
Support of Apache Airflow: in Google's case there is cloud native support for Airflow; AWS provides support through Bitnami, as does Microsoft Azure.

Integration with other cloud native provider capabilities

The Ingestion SDK uses a similar model to the rest of OSDU to provide cloud-provider SPIs to hook in Azure Storage, S3, GCS, IBM Object Storage, etc.

Authorization: provider IDP integration

The Ingestion services authorization is provided in the same ways as other OSDU services. Authorization today is very basic in OSDU; this may evolve as OSDU adopts more of the Data Ecosystem authorization model. There's also an outstanding decision around auth for long-running services, and we have the opportunity to shape that.

Airflow web console supports LDAP providers such as Active Directory for authentication.

Observability and audit

Airflow supports StatsD for metrics
- MS has contributed to the Application Insights StatsD backend, but it appears not to be active.
The Workflow /getStatus endpoint supports tracking the status of a workflow instance.
File Service status tracking is available through the audit endpoint (code here); this enables the DAG or client to compare the list of files uploaded to the list of files submitted and take appropriate action for a missing file; the action may differ depending on the file type, so this isn't expressly addressed in the service.
The Stale Jobs Scheduler is intended to identify jobs which haven't finished; this DAG is out of scope for R2.
Airflow provides pipeline tracking dashboards.
The statuses of DAG sub-processes can be bubbled up.
You can also define Airflow trigger rules to take actions such as retry when a task fails.

Decision timeline

Edited Jun 01, 2020 by Ferris Argyle

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information