Skip to content

Orchestration and Ingestion Services workflow engine

OpenDES requires an ingestion service to ingest data and enrich its metadata. The Orchestration and Ingestion flow:

  • Provides an ingestion flow which unifies OSDU R1 and DELFI Data Ecosystem requirements
  • Refactors the orchestration policy from the DELFI Data Ecosystem application code into a workflow engine

Status

  • Proposed
  • Trialing
  • Under review
  • Approved
  • Retired

Context & Scope

We need a mechanism to support agile ingestion workflow development which supports different data types and enrichment requirements, and satisfies the following criteria:

Functional:

  • Handle metadata ingestion without loading the actual data
  • Provide support of sequential and parallel tasks execution
  • Have state persistence
  • Support both sync and async operations
  • Compensation logic (try/catch)

Scaling

The Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.

Operational:

  • Have an admin/ops dashboard
  • Provide resume option for a failed job
  • Is available on each of the OSDU cloud providers

Non-functional:

Performance Scalability Security TBD

Given all that, it's suggested to use Airflow as the main engine, since it's open-source, can be deployed into k8s cluster, has a large community, and is on each of the OSDU cloud providers.

The scope of this decision is to agree on the implementation

Decision

We agreed to move forward with Airflow to support the OSDU Ingestion Framework.

Rationale

We need to move forward with understanding and implementing OSDU ingestion and enrichment workflows. The trade-off analysis to date has focused on the suitability and support-ability of various workflows technologies by cloud providers; but has been under-informed by the nature of

  • what these workflows look like
  • the scope of re-usability for parser & enrichment services
  • the scope of re-usability for the workflow declaration language itself

Selecting a system will allow us to detail workflows and remove a lot of the ambiguity about the above.

Consequences

Choosing an Open Source technology such as Air Flow will likely satisfy functional requirements for OSDU; however, it will create integration and support costs for the CSPs.

When to revisit

End of R3 development when we have more concrete information about the functional and non-functional suitability the technology as well as more information regarding where technology abstractions can be introduced.



Tradeoff Analysis - Input to decision

Without a workflow engine, workflow policy must be built into the services themselves, as it is in the DELFI Data Ecosystem; this has implications for development agility.

Without a cross-platform workflow engine, there will be different implementations of workflow policy and the actual tasks themselves (which is where the real E&P value/knowledge is embedded); this also has implications for OSDU development agility.

Alternatives and implications

The Orchestration and Ingestion flow proposes and trials an Apache Airflow approach.

At a high-level, the fundamental choice is between a solution which is as cloud native as possible on each of the cloud providers, and a solution which is agnostic of the infrastructure in which its hosted and orchestrated. As described in the system principles it is inevitable that conflicts will exist amongst many of the principles. The intent is to make informed trade-off decisions, on a service by service basis, on when to use a portable solution vs. accommodating solutions that are optimized for cost, performance and reliability, and when possible, select an Open Source API that has Cloud optimized implementations as a means of balancing portability and optimization.

Apache Airflow is an example of such an Open Source solution.

Open-source alternatives

Decision criteria and tradeoffs

Manifest-only ingestion support

Airflow supports ingestion of a manifest without data, eg. to enrich already ingested data.

  • Manifest-based file ingestion and single-file ingestion are in scope for R2
  • Implementing a DAG for ingesting a manifest without data is outside R2 scope

Scalability

As noted above, the Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.

Apache Airflow runs on K8s.

There are a number of Executors from which each cloud provider can choose; Kubernetes Executor dynamically delegates work and resources; for each and every task that needs to run, the Executor talks to the Kubernetes API to dynamically launch an additional Pod, each with its own Scheduler and Webserver, which it terminates when that task is completed:

  • In times of high traffic, you can scale up
  • In times of low traffic, you can scale to near-zero
  • For each individual task, you can configure:
    • Memory allocation
    • Service accounts
    • Airflow image

In addition, each cloud vendor may have their own managed service capability for scaling; capabilities will depend on the vendor.

Cloud native support

There are two aspects to this:

  • Support of the Ingestion services and DAGS: this will be the same support model as for the rest of OSDU
  • Support of Apache Airflow: in Google's case there is cloud native support for Airflow; AWS provides support through Bitnami, Airflow is supported on Azure through containerization (Bitnami is not recommended). More on the Airflow deployment on Azure here

Integration with other cloud native provider capabilities

The Ingestion SDK uses a similar model to the rest of OSDU to provide cloud-provider SPIs to hook in Azure Storage, S3, GCS, IBM Object Storage, etc.

Authorization: provider IDP integration

The Ingestion services authorization is provided in the same ways as other OSDU services. Authorization today is very basic in OSDU; this may evolve as OSDU adopts more of the Data Ecosystem authorization model. There's also an outstanding decision around auth for long-running services, and we have the opportunity to shape that. (And need a facility for IDP integration into AAD on Azure as it's not supported natively or through configuration)

Airflow web console supports LDAP providers such as Active Directory for authentication but only through code integration.

Observability and audit

  • Airflow supports StatsD for metrics
  • The Workflow /getStatus endpoint supports tracking the status of a workflow instance.
  • File Service status tracking is available through the audit endpoint (code here); this enables the DAG or client to compare the list of files uploaded to the list of files submitted and take appropriate action for a missing file; the action may differ depending on the file type, so this isn't expressly addressed in the service.
  • The Stale Jobs Scheduler is intended to identify jobs which haven't finished; this DAG is out of scope for R2.
  • Airflow provides pipeline tracking dashboards.
  • The statuses of DAG sub-processes can be bubbled up.
  • You can also define Airflow trigger rules to take actions such as retry when a task fails.

Decision timeline

R2

Edited by Dania Kodeih (Microsoft)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information