Data Workflow as a Core Service
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
Today, the ingestion framework consists of two services: Ingestion and Ingestion Workflow. The original intent of the Ingestion Workflow service was intended to be a simple wrapper around Airflow. Airflow will likely be used for both ingestion and enrichment data operations; making the domain for Airflow not just Ingestion. In addition, things like a UserID should not be passed as fixed parameter, as this creates AppSec issues. The proposal is to create a generic Data Workflow service as part of the Core Services, where this is truly a simple wrapper around Airflow, and where we are not prescribing parameters like UserID.
To be clear, this is a very minor change to the name and parameters for the service, in order to become more closely aligned with the intention of having a simple wrapper around Airflow.
Going forward, Data Workflows will need to be invoked directly (user action), or, a Data Workflow could be triggered by OSDU data platform events (enrichment use cases). There is no reason a single Data Workflow service could not be used for both Ingestion and Enrichment use cases.
For a more detailed description, please see the Webex recordings.
Key Use Cases
CSV
- Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
- Store CSV file using the cloud provider SDK for their object storage
- Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- “Ingest” the CSV data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the CSV file as an input parameter), that will execute the “CSV ingestion DAG”; that DAG is not there yet, but, with a little help from SLB, I think we can get it there
WITSML
- Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
- Store WITSML file using the cloud provider SDK for their object storage
- Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- “Ingest” the WITSML data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the WITSML file as an input parameter), that will execute the “WITSML ingestion DAG”; that DAG is not there yet, but, with a little help from Energistics, I think we can get it there
SegY
- Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
- Store SegY file using the cloud provider SDK for their object storage
- Register the Dataset (for the SegY File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- Store Manifest file (describing the SegY file) using the cloud provider SDK for their object storage
- Register the Dataset (for the Manifest File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- “Ingest” the SegY data using the Data Workflow service (code available in GitLab), that will execute the “SegY Manifest ingestion DAG”; we have this DAG already based on the EPAM data loader script
Decision
- Adopt this new Data Workflow service (code is already on GitLab)
- Deprecate the Ingestion Workflow service
- Update (at some point) the Ingestion service to be integrated with the Data Workflow service.
Rationale
The biggest reason to make this change is to be able to better focus / manage on the life-cycles of the DAGs and Data Operators that will ultimately be deployed on Airflow within the OSDU data platform.
Consequences
We cannot have an API that requires passing a UserID; this must be addressed. We also need to shift our focus to industrializing the management of DAGs and Data Operators. If we do not focus this effort via the establishment of a Data Workflow framework based on a single Airflow management service, we are never going to be able to properly manage the life-cycles of DAGS or Data Operators. This is a huge risk, and (if not addressed) could potentially erode the OSDU data platform value proposition.
Tradeoff Analysis - Input to decision
Decision criteria and tradeoffs
The need for managing the life-cycle of DAGS and Data Operators should be clear. This is a very simple approach that is intended to set the trajectory for managing the data workflow components on the right path.
Decision timeline
The longer it takes to shift the focus, the more technical debt we will incur.