Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2022-08-08T11:24:21Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/88The workflow service should provide outcomes and error messages produced by w...2022-08-08T11:24:21ZAlan HensonThe workflow service should provide outcomes and error messages produced by workflow instancesThe Workflow Service is capable of launching registered workflows. Those workflows presently run inside Airflow containers, which are likely implemented using containers. As these workflows run, they log information to the Airflow log fi...The Workflow Service is capable of launching registered workflows. Those workflows presently run inside Airflow containers, which are likely implemented using containers. As these workflows run, they log information to the Airflow log files. They might also implement a logging strategy that sends logs to the underlying CSP's logging framework. However, there is no programmatic way to fetch the outcomes of a workflow.
We need a uniform way to allow external processes to inquire information about a workflow beyond its execution status. Things that could be fetched via a run id, or some other mechanism, might include:
- Log messages
- Error messages (including partial failures)
- Workflow status (beyond the execution status - workflows might have their own managed state)
- Workflow results (i.e., records processed or stored [including record IDs], activities performed, etc.)
- Other details
This solution might require a broader perspective at the core service level, or it might be a solution unique to workflow service. Either way, we need to move beyond Airflow logs for a better user experience.
Additionally, as part of this story, the Workflow Service should consider statuses that indicate partial failures - meaning part of the job was successful, but part of the job also failed (even if gracefully failed).Raj KannanJane McConnellAsh SathyaseelanAlan HensonRaj Kannanhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/87Parsing error at AirflowWorkflowEngineServiceImpl class2021-02-11T14:28:10ZBhushan RadeParsing error at AirflowWorkflowEngineServiceImpl classIssue - getting parsing [error](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/workflow-core/src/main/java/org/opengroup/osdu/workflow/service/AirflowWorkflowEngineServiceImpl.java#L86)...Issue - getting parsing [error](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/workflow-core/src/main/java/org/opengroup/osdu/workflow/service/AirflowWorkflowEngineServiceImpl.java#L86) while converting Airflow response JSON String into [TriggerWorkflowResponse.java](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/workflow-core/src/main/java/org/opengroup/osdu/workflow/model/ClientResponse.java) class
Reason -
[at the tim](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/workflow-core/src/main/java/org/opengroup/osdu/workflow/service/AirflowWorkflowEngineServiceImpl.java#L86)e of parsing it expects JSON string as per TriggerWorkflowResponse.java format. but [This line ](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/workflow-core/src/main/java/org/opengroup/osdu/workflow/service/AirflowWorkflowEngineServiceImpl.java#L141) collecting airflow response in the wrong format.
propose change -
replace the 141st [line](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/workflow-core/src/main/java/org/opengroup/osdu/workflow/service/AirflowWorkflowEngineServiceImpl.java#L141) from AirflowWorkflowEngineServiceImpl.java with the following -
```
.responseBody(response.getEntity(String.class));
```ethiraj krishnamanaiduethiraj krishnamanaiduhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/86Human-readable Reference Values in Manifests2021-03-11T13:11:55ZKeith WallHuman-readable Reference Values in ManifestsR3 requires that reference values substitute %20 for any spaces in the value, when used in a manifest. For example: ": "osdu:reference-data--CoordinateReferenceSystem:WGS%2084:". Previous versions of OSDU did not require the substitut...R3 requires that reference values substitute %20 for any spaces in the value, when used in a manifest. For example: ": "osdu:reference-data--CoordinateReferenceSystem:WGS%2084:". Previous versions of OSDU did not require the substitution. It adds work and makes QC more difficult.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/75Notification Service: assess capabilities to support workflow events2021-03-03T18:58:32ZAlan HensonNotification Service: assess capabilities to support workflow eventsAssess with core services team if Notification Service design and implementation for R3 is suitable for Workflow-triggered events.
Example workflow notification use cases:
- Publish workflow state with supporting payloads to trigger as...Assess with core services team if Notification Service design and implementation for R3 is suitable for Workflow-triggered events.
Example workflow notification use cases:
- Publish workflow state with supporting payloads to trigger async downstream processing
- Indicate that errors occurredAlan HensonAlan Hensonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/issues/21Need common AirFlow GitLab repo directory structure2021-09-07T16:10:18ZJana ScheyNeed common AirFlow GitLab repo directory structureProgress is hindered by the lack of a common repository structure for Airflow components. That is, code for DAGs, operators, parsers called by operators and libraries shared by the parsers.
Currently, each cloud provider is creating its...Progress is hindered by the lack of a common repository structure for Airflow components. That is, code for DAGs, operators, parsers called by operators and libraries shared by the parsers.
Currently, each cloud provider is creating its own repo which is not synchronized with the common publicly-available one. This makes it impossible to test changes made either by CSP or Energistics developers without propagating changes manually.Jay HollingsworthDania Kodeih (Microsoft)Wladmir FrazaoJoeLaurent DenyAsh SathyaseelanDmitriy RudkoJay Hollingsworthhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/41Support for Workflow Roles - Currently Leveraging Storage Roles2020-11-18T00:46:17ZMatt WiseSupport for Workflow Roles - Currently Leveraging Storage Roles## Status
- [X] Proposed
- [x] Trialing
- [x] Under review
- [x] Approved
- [ ] Retired
## Context & Scope
The ingestion-workflow service currently is using common model StorageRole for authorization.
In code the following is observed:
...## Status
- [X] Proposed
- [x] Trialing
- [x] Under review
- [x] Approved
- [ ] Retired
## Context & Scope
The ingestion-workflow service currently is using common model StorageRole for authorization.
In code the following is observed:
```java
import org.opengroup.osdu.core.common.model.storage.StorageRole;
...
@PreAuthorize("@authorizationFilter.hasPermission('" + StorageRole.CREATOR + "')")
public GetStatusResponse getWorkflowStatus(@RequestBody GetStatusRequest request) {...}
@PreAuthorize("@authorizationFilter.hasPermission('" + StorageRole.CREATOR + "')")
public UpdateStatusResponse updateWorkflowStatus(@RequestBody UpdateStatusRequest request) {...}
```
Note that StorageRole.* is used for auth.
## Decision
A new Role model should be created called WorkflowRole and used to assign privelages.
Sample Code
```java
public final class WorkflowRole {
public static final String VIEWER = "service.workflow.viewer";
public static final String CREATOR = "service.workflow.creator";
public static final String ADMIN = "service.workflow.admin";
}
```
## Rationale
Each individual core service should have separate Roles to allow granularity for users to give entitlements
## Consequences
Need to change Core Common and Entitlements Service? Need Groups Support?JoeDmitriy RudkoArtem Nazarenko (EPAM)Joe2020-08-21https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/33Ingestion Job Notification2021-03-10T22:34:23ZKateryna Kurach (EPAM)Ingestion Job NotificationThis user story covers the following business scenarios:
- Ability to produce a notification regarding the status of Ingestion Job execution or / and it's part
- Ability to notify a user via email, text etc - Requirements have to be crea...This user story covers the following business scenarios:
- Ability to produce a notification regarding the status of Ingestion Job execution or / and it's part
- Ability to notify a user via email, text etc - Requirements have to be created here.
This functionality can be achieved by implementing a solution using Notification Service described here: https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/16
or by implementing a totally different approach.Dmitriy RudkoDmitriy Rudkohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/26Cloud Datasource Support - MSFT Azure2022-07-27T05:48:32ZKateryna Kurach (EPAM)Cloud Datasource Support - MSFT AzurePre-condition:
OSDU instance is deployed to MSFT Azure.
Cloud Storage on this diagram is a storage location outside the OSDU instance. It is located in the same MSFT Azure Cloud as OSDU instance. MSFT Team has to create a connector for ...Pre-condition:
OSDU instance is deployed to MSFT Azure.
Cloud Storage on this diagram is a storage location outside the OSDU instance. It is located in the same MSFT Azure Cloud as OSDU instance. MSFT Team has to create a connector for the Ingestion framework that enables the transfer and ingestion of raw files from the Cloud Storage into an OSDU instance. It’s up to MSFT team to decide what types of Azure Data Services should be supported in R3 scope (Azure Files, Azure Tables etc).
Acceptance Criteria:
1. Files are put into Azure Datasource
2. It is possible to configure an Ingestion workflow that will source files from the Cloud Datasource and store them in the OSDU File Storage.Dania Kodeih (Microsoft)Dania Kodeih (Microsoft)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/23[Parsers] Develop LAS 2.0 - Well Log Parser2021-06-23T17:32:27ZKateryna Kurach (EPAM)[Parsers] Develop LAS 2.0 - Well Log Parserhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/22[Parsers] Develop LAS Well Log Parser2021-06-23T17:32:38ZKateryna Kurach (EPAM)[Parsers] Develop LAS Well Log Parserhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/1Missing clear test strategy and test framework on Airflow DAG development2021-06-14T16:07:20ZWei SunMissing clear test strategy and test framework on Airflow DAG developmentNeed the clarity for DAG test framework (unit test and integration test) to ensure the code quality and no DAG broken.Need the clarity for DAG test framework (unit test and integration test) to ensure the code quality and no DAG broken.Ferris ArgyleDania Kodeih (Microsoft)Wladmir FrazaoJoeDmitriy RudkoAlan BrazKateryna Kurach (EPAM)Ferris Argylehttps://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/5[Data Flow/Ingestion] Integration with centralized/global status monitoring2021-02-11T14:08:42ZTodd Dixon[Data Flow/Ingestion] Integration with centralized/global status monitoringM1 - Release 0.1https://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/23Orchestration and Ingestion Services workflow engine2020-08-28T14:51:38ZFerris ArgyleOrchestration and Ingestion Services workflow engineOpenDES requires an ingestion service to ingest data and enrich its metadata. The [Orchestration and Ingestion flow](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-...OpenDES requires an ingestion service to ingest data and enrich its metadata. The [Orchestration and Ingestion flow](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-Detail/R2-Ingestion-Workflow-Orchestration-non-Spike):
* Provides an ingestion flow which unifies OSDU R1 and DELFI Data Ecosystem requirements
* Refactors the orchestration policy from the DELFI Data Ecosystem application code into a workflow engine
## Status
- [x] Proposed
- [x] Trialing
- [x] Under review
- [x] Approved
- [ ] Retired
## Context & Scope
We need a mechanism to support agile ingestion workflow development which supports different data types and enrichment requirements, and satisfies the following criteria:
### Functional:
* Handle metadata ingestion without loading the actual data
* Provide support of sequential and parallel tasks execution
* Have state persistence
* Support both sync and async operations
* Compensation logic (try/catch)
#### Scaling
The Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.
### Operational:
* Have an admin/ops dashboard
* Provide resume option for a failed job
* Is available on each of the OSDU cloud providers
### Non-functional:
Performance
Scalability
Security
TBD
Given all that, it's suggested to use Airflow as the main engine, since it's open-source, can be deployed into k8s cluster, has a large community, and is on each of the OSDU cloud providers.
The scope of this decision is to agree on the [implementation](/osdu/documentation/-/wikis/OSDU-(C)/Roadmap/OSDU-Release-2-Details/R2-Orchestration-and-Ingestion-non-Spike)
## Decision
We agreed to move forward with Airflow to support the OSDU Ingestion Framework.
## Rationale
We need to move forward with understanding and implementing OSDU ingestion and enrichment workflows. The trade-off analysis to date has focused on the suitability and support-ability of various workflows technologies by cloud providers; but has been under-informed by the nature of
- what these workflows look like
- the scope of re-usability for parser & enrichment services
- the scope of re-usability for the workflow declaration language itself
Selecting a system will allow us to detail workflows and remove a lot of the ambiguity about the above.
## Consequences
Choosing an Open Source technology such as Air Flow will likely satisfy functional requirements for OSDU; however, it will create integration and support costs for the CSPs.
## When to revisit
End of R3 development when we have more concrete information about the functional and non-functional suitability the technology as well as more information regarding where technology abstractions can be introduced.
---
---
# Tradeoff Analysis - Input to decision
Without a workflow engine, workflow policy must be built into the services themselves, as it is in the DELFI Data Ecosystem; this has implications for development agility.
Without a cross-platform workflow engine, there will be different implementations of workflow policy and the actual tasks themselves (which is where the real E&P value/knowledge is embedded); this also has implications for OSDU development agility.
## Alternatives and implications
The [Orchestration and Ingestion flow](/osdu/documentation/-/wikis/OSDU-(C)/Roadmap/OSDU-Release-2-Details/R2-Orchestration-and-Ingestion-non-Spike) proposes and trials an Apache Airflow approach.
At a high-level, the fundamental choice is between a solution which is as cloud native as possible on each of the cloud providers, and a solution which is agnostic of the infrastructure in which its hosted and orchestrated. As described in the [system principles](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Reference-Architecture/Principles/System-Principles) it is inevitable that conflicts will exist amongst many of the principles. The intent is to make informed trade-off decisions, on a service by service basis, on when to use a portable solution vs. accommodating solutions that are optimized for cost, performance and reliability, and when possible, select an Open Source API that has Cloud optimized implementations as a means of balancing portability and optimization.
Apache Airflow is an example of such an Open Source solution.
Open-source alternatives
* Spark: not well suited to thin services
* Apache nifi: more ETL focused
* BPEL and BPMN: sparse open-source support
* Zeebe BPMN-based Workflow Engine for Microservices Orchestration; version 0.20, the first “production-ready” release. [Once a node goes down, the entire cluster goes onto crash loop and never came up itself](https://github.com/zeebe-io/zeebe/issues/3600).
* Argo: YAML rather than Python-based, proposed by Raj and [evaluated by Microsoft](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-Detail/R3-Plug-And-Play-Workflow-For-Ingestion)
## Decision criteria and tradeoffs
### Manifest-only ingestion support
Airflow supports ingestion of a manifest without data, eg. to enrich already ingested data.
* Manifest-based file ingestion and single-file ingestion are in scope for R2
* Implementing a DAG for ingesting a manifest without data is outside R2 scope
### Scalability
As noted above, the Ingestion flow is acting as a scheduler/orchestrator where actual data movement and compute are a function of the tasks/services that it is orchestrating, so scaling is a function of the number of tasks it can run, rather than size of the data each task is processing.
Apache Airflow runs on K8s.
* Adobe Experience Platform is [scaling to 1000(s) of concurrent Airflow workflows](https://airflow.apache.org/use-cases/adobe/).
There are a number of Executors from which each cloud provider can choose; Kubernetes Executor dynamically delegates work and resources; for each and every task that needs to run, the Executor talks to the Kubernetes API to dynamically launch an additional Pod, each with its own Scheduler and Webserver, which it terminates when that task is completed:
* In times of high traffic, you can scale up
* In times of low traffic, you can scale to near-zero
* For each individual task, you can configure:
* Memory allocation
* Service accounts
* Airflow image
In addition, each cloud vendor may have their own managed service capability for scaling; capabilities will depend on the vendor.
### Cloud native support
There are two aspects to this:
* Support of the Ingestion services and DAGS: this will be the same support model as for the rest of OSDU
* Support of Apache Airflow: in Google's case there is cloud native support for Airflow; AWS [provides support through Bitnami](https://aws.amazon.com/marketplace/pp/Bitnami-Apache-Airflow-Scheduler-Container-Solutio/B07WFKMCZ3), Airflow is supported on Azure through containerization (Bitnami is not recommended). More on the Airflow deployment on Azure [here](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-Detail/R3-Plug-And-Play-Workflow-For-Ingestion)
### Integration with other cloud native provider capabilities
The Ingestion SDK uses a similar model to the rest of OSDU to provide cloud-provider SPIs to hook in Azure Storage, S3, GCS, IBM Object Storage, etc.
### Authorization: provider IDP integration
The Ingestion services authorization is provided in the same ways as other OSDU services. Authorization today is very basic in OSDU; this may evolve as OSDU adopts more of the Data Ecosystem authorization model.
There's also an outstanding decision around auth for long-running services, and we have the opportunity to shape that. (And need a facility for IDP integration into AAD on Azure as it's not supported natively or through configuration)
Airflow web console [supports LDAP providers such as Active Directory](https://airflow.apache.org/docs/stable/security.html#ldap) for authentication but only through code integration.
### Observability and audit
* [Airflow supports StatsD for metrics](https://airflow.apache.org/docs/stable/metrics.html)
* MS has contributed to the [Application Insights StatsD backend](https://github.com/microsoft/ApplicationInsights-statsd), but it appears not to be active.
* The Workflow [/getStatus endpoint](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-Detail/R2-Ingestion-Workflow-Orchestration-non-Spike#getstatus) supports tracking the status of a workflow instance.
* File Service status tracking is available through the [audit endpoint](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-Detail/R2-Ingestion-Workflow-Orchestration-non-Spike#getfilelist-internal-not-created-for-osdu-r2) (code [here](https:/github.com/google/framework-for-osdu/blob/master/osdu-r2/os-file/file-core/src/main/java/org/opengroup/osdu/file/service/FileListService.java)); this enables the DAG or client to compare the list of files uploaded to the list of files submitted and take appropriate action for a missing file; the action may differ depending on the file type, so this isn't expressly addressed in the service.
* The [Stale Jobs Scheduler](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Ingestion-and-Enrichment-Detail/R2-Ingestion-Workflow-Orchestration-non-Spike#stale-jobs-scheduler) is intended to identify jobs which haven't finished; this DAG is out of scope for R2.
* Airflow provides [pipeline tracking dashboards](https:/airflow.apache.org/docs/stable/ui.html).
* The statuses of DAG sub-processes can be bubbled up.
* You can also define Airflow [trigger rules](https:/airflow.apache.org/docs/stable/concepts.html*concepts-trigger-rule) to take actions such as retry when a task fails.
## Decision timeline
R2Stephen Whitley (Invited Expert)Ferris ArgyleStephen Whitley (Invited Expert)