Ingestion Workflow issueshttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues2022-08-23T11:19:22Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/97DELETE /v1/workflow/{workflow_name}deleteWorkflowById fails to delete workflo...2022-08-23T11:19:22ZMonalisa SrivastavaDELETE /v1/workflow/{workflow_name}deleteWorkflowById fails to delete workflow which have been executedDELETE /v1/workflow/{workflow_name}deleteWorkflowById fails to delete workflow which have been executed, however the newly created workflow which have not been executed gets deleted successfully.
Actual Result : we get the following err...DELETE /v1/workflow/{workflow_name}deleteWorkflowById fails to delete workflow which have been executed, however the newly created workflow which have not been executed gets deleted successfully.
Actual Result : we get the following error :
{
"timestamp": 1615385317197,
"status": 404,
"error": "Not Found",
"message": "Workflow: csv_OneStep_wf doesn't exist",
"path": "/api/workflow/v1/workflow/csv_OneStep_wf"
}
Expected Result : The workflow should be delete successfully
Also please note though in the API we are mentioning to use the {workflow_name} description says deleteWorkflowById which should be corrected.Aalekh JainMonalisa SrivastavaAalekh Jainhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/94Airflow: Performance design review2022-08-23T11:19:16ZAlan HensonAirflow: Performance design reviewR3 ingestion development work uncovered multiple performance issues with Airflow 1.10.x. Considerations for optimization range from the infrastructure for managing Airflow to consider an approach other than Airflow. Engage the Enterprise...R3 ingestion development work uncovered multiple performance issues with Airflow 1.10.x. Considerations for optimization range from the infrastructure for managing Airflow to consider an approach other than Airflow. Engage the Enterprise Architecture team to review the existing Workflow Service design using Airflow and determine if:
There are near-term and longer-term considerations. Near-term assumes R3M5/R3M6 development efforts. Longer-term provides space for new architectural considerations, such as cloud-native implementations with standardized workflows for write-once run-anywhere capabilities.
**Near-Term**
- Update Airflow infra to optimize always-available Airflow instances to minimize the lag between ingestion initiation and ingestion start (cost is secondary, though cost-optimized profiles are valid)
- Configure Airflow within the infrastructure as always-on vs. spin-up-on-demand. This approach increases cost but improves performance as it minimizes the delay in initiating a workflow.
- Introduce a throttling mechanism for workflow run requests to ensure Airflow is not overwhelmed to the point of failure with large numbers of request (this also needs to consider the Storage Service max-records of 500)
- Understand what scaling capabilities the CSPs have implemented and whether those are captured as best practices
- Determine SLAs for workflows in terms of parallelism, CPU and memory consumption, etc.
**Longer-Term ** (will break out into separate issue)
- A migration to Airflow 2.x should be considered
- What infrastructure updates could be made to support better scalability
- Determine SLAs for workflows in terms of parallelism, CPU and memory consumption, etc.
- Consider additional data processing capabilities (e.g., Apache Spark or Apache Beam)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/132Add a version of Airflow into an endpoint 'info' for Workflow Service [GONRG-...2021-12-15T19:43:52ZKateryna Kurach (EPAM)Add a version of Airflow into an endpoint 'info' for Workflow Service [GONRG-3777]Add a version of Airflow into an endpoint 'info' for Workflow Service
Add v1 into /api/workflow/info
Expected path:
{workflow}
/api/workflow/v1/infoAdd a version of Airflow into an endpoint 'info' for Workflow Service
Add v1 into /api/workflow/info
Expected path:
{workflow}
/api/workflow/v1/infoM10 - Release 0.13https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/44CSV Ingestion - Horizon 1 - Workflow Service Tasks2021-06-16T22:17:52ZStephen Whitley (Invited Expert)CSV Ingestion - Horizon 1 - Workflow Service Tasks
- [x] Create an end point to create a dag by passing .py file.
- [ ] Ability to validate a dag for syntactical issue.- Check for valid airflow constructs,Check for cyclicity in dags.
- [ ] Ability to save the .py file(dag) in airflow /d...
- [x] Create an end point to create a dag by passing .py file.
- [ ] Ability to validate a dag for syntactical issue.- Check for valid airflow constructs,Check for cyclicity in dags.
- [ ] Ability to save the .py file(dag) in airflow /dag mount.
- [ ] Ability to check if dag is successfully registered in Airflow.
- [ ] Ability to restore back the old dag in case of dag update.
- [ ] Ability to delete a dag.
- [ ] Ability to view an airflow dag.
- [ ] Ability to trigger multiple dags.
- [x] Ability to trigger a dag.
- [ ] Ability to stop a dag run.
- [ ] Ability to pause/un pasue a dag.
- [ ] Ability to get previous executions of a dag.
- [ ] Ability to get details of a dag run.
- [ ] Ability to clear and re run the failed dag from where it failedTodd DixonTodd Dixonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/98GET /v1/workflow/{workflow_name}/workflowRun getAllRunInstances doesnt respon...2021-04-09T13:07:56ZMonalisa SrivastavaGET /v1/workflow/{workflow_name}/workflowRun getAllRunInstances doesnt respond correctlyGET /v1/workflow/{workflow_name}/workflowRun getAllRunInstances reqs params even with blank json it gives a 200 OK however no details in response
Following filters are also missing :
String prefix = (String) params.get("prefix");
String ...GET /v1/workflow/{workflow_name}/workflowRun getAllRunInstances reqs params even with blank json it gives a 200 OK however no details in response
Following filters are also missing :
String prefix = (String) params.get("prefix");
String startDate = (String) params.get("startDate");
String endDate = (String) params.get("endDate");
String limit = (String) params.get("limit");Kishore BattulaAalekh JainKishore Battulahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/77Airflow Performance / Load testing2021-03-23T11:38:38ZKateryna Kurach (EPAM)Airflow Performance / Load testingIn conversations with Data Loading (Michaël, Ash, and others), we identified a need to develop an approach to determining performance requirements for the workflow service. Concerns have been raised based on implementation experience tha...In conversations with Data Loading (Michaël, Ash, and others), we identified a need to develop an approach to determining performance requirements for the workflow service. Concerns have been raised based on implementation experience that AirFlow will not properly scale based on anticipated data loading demands.
I've expanded this issue to include representation from Data Loading, all 4 CSPs, and @Jane from EA. We should begin addressing this for M5 or shortly thereafter.
Initial discussions have identified two potential areas for improvement:
- Configure Airflow within the infrastructure as always-on vs. spin-up-on-demand. This approach increases cost but improves performance as it minimizes the delay in initiating a workflow.
- Introduce a throttling mechanism for workflow run requests to ensure Airflow is not overwhelmed to the point of failure with large numbers of request
There are likely other performance improvements to consider. We will update this description as those are discussed.Jane McConnellAsh SathyaseelanKishore BattulaKateryna Kurach (EPAM)Matt WiseAlan HensonJane McConnellhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/81Ability to replace surrogate-key ids before storing resource to Storage2021-03-05T18:15:03ZKateryna Kurach (EPAM)Ability to replace surrogate-key ids before storing resource to StorageAirflow DAG will be able to replace resource “Id” parameter in surrogate-key format into a system-generated “Id” format during ingestion.
Some details on the logic:
Master, Reference data – replacement of “id” field in the correspond...Airflow DAG will be able to replace resource “Id” parameter in surrogate-key format into a system-generated “Id” format during ingestion.
Some details on the logic:
Master, Reference data – replacement of “id” field in the corresponding schema
WP ingestion:
Dataset should be stored, Dataset system-generated id should be obtained. DAG should replace:
Dataset id in the “Dataset” array in the Manifest schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/Manifest.1.0.0.json )
Id values in the “Datasets” array in the WPC schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProductComponent.1.0.0.json )
WPC should be stored, WPC system-generated id should be obtained. DAG should replace:
“Id” value in the GenericWorkProductComponent schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProductComponent.1.0.0.json )
WPC id in the “Components” array in the GenericWorkProduct schema ( https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProduct.1.0.0.json )
Artefact should be stored, “Id” value should be replaced in the “ResourceId” property in the “Artefacts” array in the GenericWorkProductComponent schema (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProductComponent.1.0.0.json )
WP should be stored, “Id’ value should be replaced in the “Id” property in GenericWorkProduct schema ( https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/manifest/GenericWorkProduct.1.0.0.json )Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/83[Validation] Dataset file or file collection has been already ingested into O...2021-02-26T17:48:18ZKateryna Kurach (EPAM)[Validation] Dataset file or file collection has been already ingested into OSDU before ingesting it’s metadataValidation whether Dataset file or file collection has been already ingested into OSDU before ingesting it’s metadata.
Scope: Dataset
The logic for this check will be a little bit different depending on the type of the Dataset (F...Validation whether Dataset file or file collection has been already ingested into OSDU before ingesting it’s metadata.
Scope: Dataset
The logic for this check will be a little bit different depending on the type of the Dataset (File or File Collection):
For File type (schema “AbstractFileSourceInfo” is used: https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/abstract/AbstractFileSourceInfo.1.0.0.json ) validation should be performed to check that “FileSource” parameter exists
For File Collection type (schema “AbstractFileCollection” schema https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Authoring/abstract/AbstractFileCollection.1.0.0.json ) the following validation steps should be performed:
Step 1: Does “IndexFilePath” exist?
If yes -> validation pass
If no -> proceed to Step 2
Step 2: For each file in the collection check whether “FileSource” parameter exists.
If yes -> validation pass
If no -> validation fail
If validation fails, reject the whole WPC this Dataset belongs to.Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/82[Validation] Referential integrity between Datasets and WPC2021-02-26T17:34:56ZKateryna Kurach (EPAM)[Validation] Referential integrity between Datasets and WPCValidation of referential integrity between Datasets and WPC.
Scope: Dataset, WPC
This step is needed to validate that we don’t ingest any WPCs with references to non-existing Datasets and we don’t ingest any orphan Datasets.
All Ids ...Validation of referential integrity between Datasets and WPC.
Scope: Dataset, WPC
This step is needed to validate that we don’t ingest any WPCs with references to non-existing Datasets and we don’t ingest any orphan Datasets.
All Ids (surrogate or real ids) of the datasets specified in the WPC “Datasets” array should correspond to the ids (surrogate or real) of records in the Manifest “Datasets” array.
WPC Resources that fail this validation should be rejected.
All Ids (surrogate or real ids) of the datasets specified in the Manifest “Datasets” array should be present in any WPC “Datasets” array of the WP.
Dataset Resources that fail this validation should be rejected.Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/80Remove CSP dependencies in the main ingestion DAG (osdu_ingest)2021-02-11T20:49:00ZKateryna Kurach (EPAM)Remove CSP dependencies in the main ingestion DAG (osdu_ingest)We need to make sure that all CSP dependencies are removed and osdu_ingest is cloud-agnostic.
(With the exception of authentication module).We need to make sure that all CSP dependencies are removed and osdu_ingest is cloud-agnostic.
(With the exception of authentication module).Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/76Address Workflow Service updates per ADR #712021-02-10T19:55:29ZAlan HensonAddress Workflow Service updates per ADR #71The Workflow Service ADR contained multiple changes to the Workflow Service endpoints. One of the highest priority items is to register and trigger a workflow by name. This issue addresses these two proposed changes of the ADR.
ADR: htt...The Workflow Service ADR contained multiple changes to the Workflow Service endpoints. One of the highest priority items is to register and trigger a workflow by name. This issue addresses these two proposed changes of the ADR.
ADR: https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/71
Spec: https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/refactoring_workflow/docs/api/openapi.workflow.yaml
APIs covered by this issue:
- [POST] /v1/workflow
- [POST] /v1/workflow/{workflow_name}/workflowRun
- [PUT] /v1/workflow/{workflow_name}/workflowRun/{runId}
Implementation complete:
- [X] AWS
- [X] GCP/EPAM
- [X] IBM
- [X] Microsofthttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/7Ingestion Workflow - Evaluation of Architectural Elements2021-02-09T19:24:29ZMeena RathinavelIngestion Workflow - Evaluation of Architectural ElementsM1 - Release 0.1Stephen Whitley (Invited Expert)Alan HensonStephen Whitley (Invited Expert)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/72Seek information pertaining to workflow, DAG, DAG Operator, and runtime envir...2021-01-20T19:46:19ZAlan HensonSeek information pertaining to workflow, DAG, DAG Operator, and runtime environments from all CSPs and data workflow teams.Request made of CSP and data workflow development teams (CSV, EDS, Energistics/WITSML, Manifest):
Per today’s daily dev standup discussion, I’m requesting information regarding the environment information for your workflow service, Airf...Request made of CSP and data workflow development teams (CSV, EDS, Energistics/WITSML, Manifest):
Per today’s daily dev standup discussion, I’m requesting information regarding the environment information for your workflow service, Airflow implementation, DAGs, and DAG Operators. Please fill out this table and send it back to me. I will aggregate and share with the group. We will use this as a baseline to address the next steps in unifying workflow environments to ensure the DAGs you and your teams are writing will run across all four CSP platforms. This effort will also drive discussions for standardization.
Given some teams are on holiday through Monday of next week, please target getting this to me by next Wednesday, Jan 20th. I will remind you in the daily dev standups. If you have follow-up questions, please let me know.
For the CSV, EDS, and Energistics/WITSML teams, please disregard the questions on Airflow as I know you depend on the CSP implementation for that answer. Please address the DAG Operator and container questions where possible.Alan HensonAlan Hensonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/2Evaluate Ingestion Framework Implementation on Azure2020-09-01T00:34:30ZDania Kodeih (Microsoft)Evaluate Ingestion Framework Implementation on AzureDania Kodeih (Microsoft)Daniel SchollDania Kodeih (Microsoft)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/40Issues with POMs in the repo (Circular dependency from Core to Test-Core and ...2020-08-20T20:08:50ZMatt WiseIssues with POMs in the repo (Circular dependency from Core to Test-Core and POM dependencies are structured differently than other services)The POMs in this service are structured differently than other services. In other services, the parent pom contains almost no dependencies and allows the Core & Test-Core POMs to specify dependencies individually.
In addition, the Test...The POMs in this service are structured differently than other services. In other services, the parent pom contains almost no dependencies and allows the Core & Test-Core POMs to specify dependencies individually.
In addition, the Test project is tightly coupled to the build of the Core creating a circular dependency.
In the root POM, the following is observed:
```xml
<modules>
<module>workflow-core</module>
<module>provider/workflow-azure</module>
<module>provider/workflow-gcp</module>
<!-- <module>provider/workflow-ibm</module> Fix: Missing classes-->
<module>provider/workflow-gcp-datastore</module>
<module>testing/workflow-test-core</module>
</modules>
```
Note that the module `testing/workflow-test-core` is referenced in the modules list. The test modules should know about the core modules, but not the other way around.
If the test module is removed from the build list, the project fails to compile successfully.Dmitriy RudkoOleksandr Kosse (EPAM)Riabokon Stanislav(EPAM)[GCP]Artem Nazarenko (EPAM)Dmitriy Rudko2020-08-21https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/42FOSSA NOTICE out of date2020-08-20T18:58:49ZDavid Diederichd.diederich@opengroup.orgFOSSA NOTICE out of dateAs of ad2f1ffa the FOSSA NOTICE file is out of date ([Job Output](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/jobs/41208)). This may be related to the recent upgrade in FOSSA version -- osdu/pla...As of ad2f1ffa the FOSSA NOTICE file is out of date ([Job Output](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/jobs/41208)). This may be related to the recent upgrade in FOSSA version -- osdu/platform/ci-cd-pipelines!40.M1 - Release 0.1David Diederichd.diederich@opengroup.orgDavid Diederichd.diederich@opengroup.org