Ingestion Workflow issueshttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues2021-02-26T17:48:18Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/83[Validation] Dataset file or file collection has been already ingested into O...2021-02-26T17:48:18ZKateryna Kurach (EPAM)[Validation] Dataset file or file collection has been already ingested into OSDU before ingesting it’s metadataValidation whether Dataset file or file collection has been already ingested into OSDU before ingesting it’s metadata.
Scope: Dataset
The logic for this check will be a little bit different depending on the type of the Dataset (F...Validation whether Dataset file or file collection has been already ingested into OSDU before ingesting it’s metadata.
Scope: Dataset
The logic for this check will be a little bit different depending on the type of the Dataset (File or File Collection):
For File type (schema “AbstractFileSourceInfo” is used: https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/abstract/AbstractFileSourceInfo.1.0.0.json ) validation should be performed to check that “FileSource” parameter exists
For File Collection type (schema “AbstractFileCollection” schema https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/abstract/AbstractFileCollection.1.0.0.json ) the following validation steps should be performed:
Step 1: Does “IndexFilePath” exist?
If yes -> validation pass
If no -> proceed to Step 2
Step 2: For each file in the collection check whether “FileSource” parameter exists.
If yes -> validation pass
If no -> validation fail
If validation fails, reject the whole WPC this Dataset belongs to.Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/82[Validation] Referential integrity between Datasets and WPC2021-02-26T17:34:56ZKateryna Kurach (EPAM)[Validation] Referential integrity between Datasets and WPCValidation of referential integrity between Datasets and WPC.
Scope: Dataset, WPC
This step is needed to validate that we don’t ingest any WPCs with references to non-existing Datasets and we don’t ingest any orphan Datasets.
All Ids ...Validation of referential integrity between Datasets and WPC.
Scope: Dataset, WPC
This step is needed to validate that we don’t ingest any WPCs with references to non-existing Datasets and we don’t ingest any orphan Datasets.
All Ids (surrogate or real ids) of the datasets specified in the WPC “Datasets” array should correspond to the ids (surrogate or real) of records in the Manifest “Datasets” array.
WPC Resources that fail this validation should be rejected.
All Ids (surrogate or real ids) of the datasets specified in the Manifest “Datasets” array should be present in any WPC “Datasets” array of the WP.
Dataset Resources that fail this validation should be rejected.Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/81Ability to replace surrogate-key ids before storing resource to Storage2021-03-05T18:15:03ZKateryna Kurach (EPAM)Ability to replace surrogate-key ids before storing resource to StorageAirflow DAG will be able to replace resource “Id” parameter in surrogate-key format into a system-generated “Id” format during ingestion.
Some details on the logic:
Master, Reference data – replacement of “id” field in the correspond...Airflow DAG will be able to replace resource “Id” parameter in surrogate-key format into a system-generated “Id” format during ingestion.
Some details on the logic:
Master, Reference data – replacement of “id” field in the corresponding schema
WP ingestion:
Dataset should be stored, Dataset system-generated id should be obtained. DAG should replace:
Dataset id in the “Dataset” array in the Manifest schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/Manifest.1.0.0.json )
Id values in the “Datasets” array in the WPC schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProductComponent.1.0.0.json )
WPC should be stored, WPC system-generated id should be obtained. DAG should replace:
“Id” value in the GenericWorkProductComponent schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProductComponent.1.0.0.json )
WPC id in the “Components” array in the GenericWorkProduct schema ( https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProduct.1.0.0.json )
Artefact should be stored, “Id” value should be replaced in the “ResourceId” property in the “Artefacts” array in the GenericWorkProductComponent schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProductComponent.1.0.0.json )
WP should be stored, “Id’ value should be replaced in the “Id” property in GenericWorkProduct schema ( https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProduct.1.0.0.json )Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/80Remove CSP dependencies in the main ingestion DAG (osdu_ingest)2021-02-11T20:49:00ZKateryna Kurach (EPAM)Remove CSP dependencies in the main ingestion DAG (osdu_ingest)We need to make sure that all CSP dependencies are removed and osdu_ingest is cloud-agnostic.
(With the exception of authentication module).We need to make sure that all CSP dependencies are removed and osdu_ingest is cloud-agnostic.
(With the exception of authentication module).Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/78Workflow Service: Update Python version2021-01-26T20:12:09ZAlan HensonWorkflow Service: Update Python versionPer this [ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/74), all Airflow and Python runtime environments across the 4 CSPs should be uniform.
AWS is on Python 3.8, but need to revert ...Per this [ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/74), all Airflow and Python runtime environments across the 4 CSPs should be uniform.
AWS is on Python 3.8, but need to revert to 3.6.x.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/77Airflow Performance / Load testing2021-03-23T11:38:38ZKateryna Kurach (EPAM)Airflow Performance / Load testingIn conversations with Data Loading (Michaël, Ash, and others), we identified a need to develop an approach to determining performance requirements for the workflow service. Concerns have been raised based on implementation experience tha...In conversations with Data Loading (Michaël, Ash, and others), we identified a need to develop an approach to determining performance requirements for the workflow service. Concerns have been raised based on implementation experience that AirFlow will not properly scale based on anticipated data loading demands.
I've expanded this issue to include representation from Data Loading, all 4 CSPs, and @Jane from EA. We should begin addressing this for M5 or shortly thereafter.
Initial discussions have identified two potential areas for improvement:
- Configure Airflow within the infrastructure as always-on vs. spin-up-on-demand. This approach increases cost but improves performance as it minimizes the delay in initiating a workflow.
- Introduce a throttling mechanism for workflow run requests to ensure Airflow is not overwhelmed to the point of failure with large numbers of request
There are likely other performance improvements to consider. We will update this description as those are discussed.Jane McConnellAsh SathyaseelanKishore BattulaKateryna Kurach (EPAM)Matt WiseAlan HensonJane McConnellhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/76Address Workflow Service updates per ADR #712021-02-10T19:55:29ZAlan HensonAddress Workflow Service updates per ADR #71The Workflow Service ADR contained multiple changes to the Workflow Service endpoints. One of the highest priority items is to register and trigger a workflow by name. This issue addresses these two proposed changes of the ADR.
ADR: htt...The Workflow Service ADR contained multiple changes to the Workflow Service endpoints. One of the highest priority items is to register and trigger a workflow by name. This issue addresses these two proposed changes of the ADR.
ADR: https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/71
Spec: https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/refactoring_workflow/docs/api/openapi.workflow.yaml
APIs covered by this issue:
- [POST] /v1/workflow
- [POST] /v1/workflow/{workflow_name}/workflowRun
- [PUT] /v1/workflow/{workflow_name}/workflowRun/{runId}
Implementation complete:
- [X] AWS
- [X] GCP/EPAM
- [X] IBM
- [X] Microsofthttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/74ADR: Workflow Service Environment Standardization2021-03-23T11:45:21ZAlan HensonADR: Workflow Service Environment Standardization## Context
Providing consistent workflow runtime environments enables DAGs (Directed Acyclic Graphs) to be written once and run across any standardized workflow service environment. There are some differences in the Workflow Service envi...## Context
Providing consistent workflow runtime environments enables DAGs (Directed Acyclic Graphs) to be written once and run across any standardized workflow service environment. There are some differences in the Workflow Service environments built for R3, so we must agree on the versions of the major components of the Workflow Service to achieve standardization.
## Scope
- All Workflow Service implementations should operate with the same `major.minor` version of Airflow.
- All Workflow Service implementations should operate with the same `major.minor` Python version within Airflow.
- All Workflow Service DAG Operators should be authored to run with the same `major.minor` Python version within Airflow.
## Decision
Standardize on the following Workflow Service component versions
| Component | Version |
| --------- | ------- |
| Airflow | 1.10.x |
| Airflow Python Runtime | 3.6.x |
| DAG Operator Python Development Version | 3.6.x |
## Rationale
- Workflows (DAGs) written against the standard will be portable to all standardized Workflow Service runtime environments.
## Consequences
- Workflow Service implementers may have to change Airflow and Python versions and re-test developed workflows (DAGs)Alan HensonAlan Hensonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/72Seek information pertaining to workflow, DAG, DAG Operator, and runtime envir...2021-01-20T19:46:19ZAlan HensonSeek information pertaining to workflow, DAG, DAG Operator, and runtime environments from all CSPs and data workflow teams.Request made of CSP and data workflow development teams (CSV, EDS, Energistics/WITSML, Manifest):
Per today’s daily dev standup discussion, I’m requesting information regarding the environment information for your workflow service, Airf...Request made of CSP and data workflow development teams (CSV, EDS, Energistics/WITSML, Manifest):
Per today’s daily dev standup discussion, I’m requesting information regarding the environment information for your workflow service, Airflow implementation, DAGs, and DAG Operators. Please fill out this table and send it back to me. I will aggregate and share with the group. We will use this as a baseline to address the next steps in unifying workflow environments to ensure the DAGs you and your teams are writing will run across all four CSP platforms. This effort will also drive discussions for standardization.
Given some teams are on holiday through Monday of next week, please target getting this to me by next Wednesday, Jan 20th. I will remind you in the daily dev standups. If you have follow-up questions, please let me know.
For the CSV, EDS, and Energistics/WITSML teams, please disregard the questions on Airflow as I know you depend on the CSP implementation for that answer. Please address the DAG Operator and container questions where possible.Alan HensonAlan Hensonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/71ADR: Workflow Service - R3 Improvements2021-04-15T12:58:17ZDmitriy RudkoADR: Workflow Service - R3 Improvements## Context
During work with different stream, we identify several critical design issues with Workflow service that needs to be addressed in R3:
* Workflow service is not just an `abstraction` over orchestration engine (Airflow) but also...## Context
During work with different stream, we identify several critical design issues with Workflow service that needs to be addressed in R3:
* Workflow service is not just an `abstraction` over orchestration engine (Airflow) but also contains OSDU specific logic (`DataType`, `WorkflowType`, `UserType`). This logic should be moved to Ingestion Service.
* Workflow Service do not respect Data Partitions. Users potentially can trigger any Workflow in the system.
* There is not functionality to register a new Workflow
## Scope
- Add functionality to register new Workflows
- Add support of Data Partitions
- Remove OSDU specific workflow functionality (`DataType`, `WorkflowType`, `UserType`) from Workflow Service.
- Allow OSDU clients directly trigger registered Workflows, without Ingestion Service.
- Update API to reflect [Google REST API Design Guide](https://cloud.google.com/apis/design). Please see[OpenAPI Spec](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/refactoring_workflow/docs/api/openapi.workflow.yaml) for details.
## Decision
- Accept API changes as a part of R3
- Accept Workflow > Core changes as a part of R3
- Deprecate exiting Workflow API (startWorkflow, etc)
## Rationale
- Registration of workflows required for E2E R3 Ingestion
- API spec is on critical path for CSV Ingestion
## Consequences
- Most of the Core logic changes will be implemented by GCP
- Will require support of CSPs as SPI layer will be touched.
## When to revisit
- Post R3
## Technical details:
![R3_Workflow_-_L3__Target](/uploads/75f02f3ec73ee85a95bb668dc7426df2/R3_Workflow_-_L3__Target.png)
![R3_Workflow_-_L4__Target](/uploads/03429b8474b61049b4327ae920969374/R3_Workflow_-_L4__Target.png)
### SPI Layer:
- `IWorkflowEngineService` - **Has default implementation.** Abstraction over orchestration engine. By default we have implementation for Airflow.
- `IWorkflowManagerService` - **Has default implementation.** Implements CRUD over Workflow entity.
- `IWorkflowRunService` - - **Has default implementation.** Implements CRUD over Workflow Run entity.
- `IWorkflowMetadataRepository` - Should be implemented by CSP!. Repository for Workflow entity.
- `IWorkflowRunRepository` - Should be implemented by CSP!. Repository for Workflow Run entityM1 - Release 0.1Dmitriy RudkoDmitriy Rudkohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/70Upgrading getSignedUrl api to use newly created containers for data storage2021-01-12T09:01:25ZAalekh JainUpgrading getSignedUrl api to use newly created containers for data storage**Overview**:
Currently we only have a single container where the data is being stored for all the dag runs. The data sharing across tasks is being done by generating SAS tokens at container level. This gives the access to any dag run ...**Overview**:
Currently we only have a single container where the data is being stored for all the dag runs. The data sharing across tasks is being done by generating SAS tokens at container level. This gives the access to any dag run to access the data from any other dag runs as well (which is the major concern to change the existing infrastructure).
This will be handled by creating a storage account where a new container will be created every time a dag run is triggered. The sas token will then be generated for the newly created container (which is created dedicatedly to store the data for this new dag run) and hence restricting the access.
The current behaviour will be to create new containers in the storage account when we hit getSignedUrl api and generate the sas tokens corresponding to the newly created containers.
There are following queries that needs to be resolved -
1. Currently new containers are created on the fly when we are hitting the getSignedUrl endpoint. What should be the behaviour in case a request is repeated for the same workflowId and runId? (of course it doesn't make sense to again create a new container)
2. We can't create containers using workflowId as the name of the container (contains special characters in the name which are not allowed). How should we go about this? Should we create the container with runId or uuid? Or probably we should map the workflowId to a new uuid, store that in a database and then create the container with this new uuid?
Notes:
1. The implementation for container creation currently resides in the service code (ingestion). Going forward, once we finalize on the architectural changes, this will be moved to the azure core lib (blobStore).
cc: @kibattulhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/68Backup and Restore support2021-06-16T22:17:32ZKateryna Kurach (EPAM)Backup and Restore supporthttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/67[Master and Reference Data] Frame of Reference Handling2021-06-16T22:17:33ZKateryna Kurach (EPAM)[Master and Reference Data] Frame of Reference HandlingReqs for Frame of Reference handling will be added laterReqs for Frame of Reference handling will be added laterhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/66[Master and Reference Data] Reference Data Ingestion process2020-10-22T18:18:21ZKateryna Kurach (EPAM)[Master and Reference Data] Reference Data Ingestion processReference Data manifest-based ingestionReference Data manifest-based ingestionKateryna Kurach (EPAM)Kateryna Kurach (EPAM)2020-11-10https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/65[Master and Reference Data] Master Data Ingestion process2020-10-22T18:19:05ZKateryna Kurach (EPAM)[Master and Reference Data] Master Data Ingestion processMaster Data Manifest-based ingestionMaster Data Manifest-based ingestionKateryna Kurach (EPAM)Kateryna Kurach (EPAM)2020-11-10https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/64[Parsers] Integrate developed SEG-Y - Seismic Parser into Ingestion Framework2021-06-16T22:17:34ZKateryna Kurach (EPAM)[Parsers] Integrate developed SEG-Y - Seismic Parser into Ingestion FrameworkThis user story is create to track an effort needed to integrate developed SEG-Y Seismic Parser (developed by CGI team or some other party) into the Ingestion Framework.This user story is create to track an effort needed to integrate developed SEG-Y Seismic Parser (developed by CGI team or some other party) into the Ingestion Framework.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/62[Parsers] Integrate developed WITSML Parser into Ingestion Framework2021-06-16T22:17:36ZKateryna Kurach (EPAM)[Parsers] Integrate developed WITSML Parser into Ingestion FrameworkWITSML Parser has been developed by Energistics. This user story is to track an effort needed to spend on integration this WITSML Parser into the Ingestion Framework.
Energistics will support 4 WITSML data types: Well Log, Directional Su...WITSML Parser has been developed by Energistics. This user story is to track an effort needed to spend on integration this WITSML Parser into the Ingestion Framework.
Energistics will support 4 WITSML data types: Well Log, Directional Survey, Tubulars, Well Markers. Is it 1 parser or 4 parsers?https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/61[Parsers] Pluggable logic of duplicated files detection - HASH calculation2021-06-16T22:17:37ZKateryna Kurach (EPAM)[Parsers] Pluggable logic of duplicated files detection - HASH calculationAccording to OSDU Reference Architecture, duplicated files / data should not be saved in OSDU.
HASH code should be calculated for each ingested file / WPC and this value should be stored in the schema.
It seems that it makes sense to imp...According to OSDU Reference Architecture, duplicated files / data should not be saved in OSDU.
HASH code should be calculated for each ingested file / WPC and this value should be stored in the schema.
It seems that it makes sense to implement this logic on the parser side: each parser developer will calculate HASH code for the ingested WPC.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/60[Validation] Manifest Validation - check WPC Id for uniqueness2021-06-16T22:17:38ZKateryna Kurach (EPAM)[Validation] Manifest Validation - check WPC Id for uniquenessWPC Id should be unique.
2 cases should be covered:
- If a user provided an Id, it should be unique
- If user didnt provide an Id - the system should generate a unique IdWPC Id should be unique.
2 cases should be covered:
- If a user provided an Id, it should be unique
- If user didnt provide an Id - the system should generate a unique Idhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/59[Validation] [Master and Reference Data] Manifest Validation - check Master D...2021-01-29T21:00:37ZKateryna Kurach (EPAM)[Validation] [Master and Reference Data] Manifest Validation - check Master Data recordsValidation that Master data values (SRNs) point to existing records.Validation that Master data values (SRNs) point to existing records.