Data Workflow Service
We are often left to address the gaps from architectural principles (which stay at a pretty high and abstract level) to the actual implementation detail. Here is an attempt to bridge that gap by providing a set of Lightweight Architecture Decision Records (LADRs) which are simple to follow and can be implemented in a given team/project by the developers
Decision Title
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
The Data Workflow service is similar in concept to the Ingestion Workflow service, but extends the workflow abstraction concept to encompass many use cases, rather than just ingestion.
- The “Start Workflow” endpoint no longer requires mapping WorkflowType and DataType to a DAG or mapping a user to a workflow. The mapping from WorkflowType and DataType to a DAG used by Ingestion Workflow service makes it more difficult to tell what DAG will actually be run. It also required directly editing data to create an ingestion strategy which is difficult to setup and manage. Additionally, it scoped workflows to the user level making it so workflows couldn’t be managed across users. Here’s the change in request bodies:
From:
{
"WorkflowType": "ingest",
"DataType": "opaque",
"Context": {}
}
To:
{
"dagName":"dag_name",
"inputParameters":{
"datasetRegistryIds":[ ]
}
}
This new body makes it so we can directly execute Workflows by passing in whatever inputs the DAG needs including dataset registry ids to use with the Dataset service. It does require that the application kicking off a workflow know the DAG name corresponding to it’s workflow but this actually makes it easier for developers to know and control what’s actually happening with workflows.
-
While building out the External Data Services functionality, we found that Airflow’s scheduling of DAGs has two shortcomings:
- You can only have one schedule per DAG.
- Schedules on DAGs cannot change without modifying the code and re-deploying those DAGs.
Therefore, we need an OSDU service to leverage advanced scheduling capabilities, wrapping Airflow’s DAGs in a cloud provider’s scheduling technology. In the attached swagger doc, the following endpoints have been added to support scheduling:
* Create Workflow Schedule
* Get Workflow Schedules
* Delete Workflow Schedule
* List Workflow Schedules
The schedule will use cron formatting to ensure a common standard is followed.
Decision
Replace the Ingestion Workflow service with the Data Workflow service.
Rationale
* Eliminate the obfuscated mapping of WorkflowType, DataType to a DAG
* Eliminate the need to directly edit data to map WorkflowType, DataType to a DAG
* Eliminate the need to tie a user’s id to a workflow
* Clearly define how kicking off a workflow in OSDU works
* Introduce more robust scheduling functionality on OSDU workflows
Consequences
Consumers of the existing Ingestion Workflow service will need to migrate to the Data Workflow service’s API. Developers will need to be aware of DAG deployments to be able to explicitly call them from their applications
When to revisit
Tradeoff Analysis - Input to decision
Customers will want the power to create dynamic workflows in OSDU and they’ll want to be able to do so in a stable manner in production. Additionally, EDS use cases require dynamic scheduling not available in Airflow or the Ingestion Workflow service.
Decision timeline
Decision ready to be made
Examples:
StartWorkflow InputParameters example for a csv file:
{
"dagName":"csv_parser",
"inputParameters":{
"datasetRegistryIds":[
"opendes:doc:0d6810c6206f40699e39277b23166f77"
],
"headers": ["time & date", "some_prop1", "some_prop2" ],
"units": ["srn:osdu:unit:date", "srn:osdu:unit:some1", "srn:osdu:unit:some2"]
}
}
Attachments: Swagger: DataWorkflow.swagger.yaml