Data Workflow as a Core Service

Status

Context & Scope

Today, the ingestion framework consists of two services: Ingestion and Ingestion Workflow. The original intent of the Ingestion Workflow service was intended to be a simple wrapper around Airflow. Airflow will likely be used for both ingestion and enrichment data operations; making the domain for Airflow not just Ingestion. In addition, things like a UserID should not be passed as fixed parameter, as this creates AppSec issues. The proposal is to create a generic Data Workflow service as part of the Core Services, where this is truly a simple wrapper around Airflow, and where we are not prescribing parameters like UserID.

To be clear, this is a very minor change to the name and parameters for the service, in order to become more closely aligned with the intention of having a simple wrapper around Airflow.

Going forward, Data Workflows will need to be invoked directly (user action), or, a Data Workflow could be triggered by OSDU data platform events (enrichment use cases). There is no reason a single Data Workflow service could not be used for both Ingestion and Enrichment use cases.

For a more detailed description, please see the Webex recordings.

https://opengroup.webex.com/recordingservice/sites/opengroup/recording/play/78a6c3fd5d3d42b7b3658d5ee5af4064

https://opengroup.webex.com/recordingservice/sites/opengroup/recording/play/e672cec10fbc43ff806ca435a809447e

Key Use Cases

CSV

Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
Store CSV file using the cloud provider SDK for their object storage
Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
“Ingest” the CSV data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the CSV file as an input parameter), that will execute the “CSV ingestion DAG”; that DAG is not there yet, but, with a little help from SLB, I think we can get it there

WITSML

Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
Store WITSML file using the cloud provider SDK for their object storage
Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
“Ingest” the WITSML data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the WITSML file as an input parameter), that will execute the “WITSML ingestion DAG”; that DAG is not there yet, but, with a little help from Energistics, I think we can get it there

SegY

Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
Store SegY file using the cloud provider SDK for their object storage
Register the Dataset (for the SegY File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
Store Manifest file (describing the SegY file) using the cloud provider SDK for their object storage
Register the Dataset (for the Manifest File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
“Ingest” the SegY data using the Data Workflow service (code available in GitLab), that will execute the “SegY Manifest ingestion DAG”; we have this DAG already based on the EPAM data loader script

Decision

Adopt this new Data Workflow service (code is already on GitLab)
Deprecate the Ingestion Workflow service
Update (at some point) the Ingestion service to be integrated with the Data Workflow service.

Rationale

The biggest reason to make this change is to be able to better focus / manage on the life-cycles of the DAGs and Data Operators that will ultimately be deployed on Airflow within the OSDU data platform.

Consequences

We cannot have an API that requires passing a UserID; this must be addressed. We also need to shift our focus to industrializing the management of DAGs and Data Operators. If we do not focus this effort via the establishment of a Data Workflow framework based on a single Airflow management service, we are never going to be able to properly manage the life-cycles of DAGS or Data Operators. This is a huge risk, and (if not addressed) could potentially erode the OSDU data platform value proposition.

Tradeoff Analysis - Input to decision

Decision criteria and tradeoffs

The need for managing the life-cycle of DAGS and Data Operators should be clear. This is a very simple approach that is intended to set the trajectory for managing the data workflow components on the right path.

Decision timeline

The longer it takes to shift the focus, the more technical debt we will incur.

Edited Nov 09, 2020 by Joe

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information