Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2023-11-20T17:58:20Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/15SEG-Y sdpath is read from the FileCollectionPath attribute instead of combini...2023-11-20T17:58:20ZSacha BrantsSEG-Y sdpath is read from the FileCollectionPath attribute instead of combining with FileSource## Data definition
Here are the definitions of `FileCollectionPath` and `FileSource` from [FileCollection.SEGY Data Properties](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/dataset/FileCollection.SEGY.1.0....## Data definition
Here are the definitions of `FileCollectionPath` and `FileSource` from [FileCollection.SEGY Data Properties](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/dataset/FileCollection.SEGY.1.0.0.md#4-table-of-filecollectionsegy-data-properties-section-abstractfilecollection)
|Cumulative Name|Description|
|---|---|
|data.DatasetProperties.FileCollectionPath|The mandatory path to the file collection. A FileCollectionPath should represent **folder level** access to a set of files.|
|data.DatasetProperties.FileSourceInfos\[\].FileSource|The location of the file. It can be a relative path. The actual access is provided via the File Service. When used in context of a FileCollection (dataset--FileCollection\*) **FileSource is a relative path** from the FileCollectionPath. It can be used by consumers to pull an individual file if they so choose by concatenating the FileCollectionPath with the FileSource. This property is required.|
## Implementation
The DAG is currently reading the `FileCollectionPath` only, assuming that it is the full `sd://` path to the SEG-Y file in Seismic DDMS.
The two interesting lines of code are in [segy_open_vds_conversion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/blob/master/osdu_airflow/operators/segy_open_vds_conversion.py#L57) and [base_metadata.py](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/blob/master/osdu_ingestion/libs/segy_conversion_metadata/base_metadata.py#L132)
## Suggested fix
- Read `FileCollection.SEGY`
- if the `FileSourceInfos` array has `FileSource` that is relative (`startswith("./")`) then concatenate with FileCollectionPath
- if the `FileSourceInfos` array has `FileSource` that is a full path (`startswith("sd://"`) then use it.
- fallback on current behavior (use `FileCollectionPath`)
- Write `FileCollection.Bluware.OpenVDS`
- Modify the created record to add `data.DatasetProperties.FileSourceInfos[].FileSource`M20 - Release 0.23Deepa KumariDeepa Kumarihttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/71ADR: Workflow Service - R3 Improvements2021-04-15T12:58:17ZDmitriy RudkoADR: Workflow Service - R3 Improvements## Context
During work with different stream, we identify several critical design issues with Workflow service that needs to be addressed in R3:
* Workflow service is not just an `abstraction` over orchestration engine (Airflow) but also...## Context
During work with different stream, we identify several critical design issues with Workflow service that needs to be addressed in R3:
* Workflow service is not just an `abstraction` over orchestration engine (Airflow) but also contains OSDU specific logic (`DataType`, `WorkflowType`, `UserType`). This logic should be moved to Ingestion Service.
* Workflow Service do not respect Data Partitions. Users potentially can trigger any Workflow in the system.
* There is not functionality to register a new Workflow
## Scope
- Add functionality to register new Workflows
- Add support of Data Partitions
- Remove OSDU specific workflow functionality (`DataType`, `WorkflowType`, `UserType`) from Workflow Service.
- Allow OSDU clients directly trigger registered Workflows, without Ingestion Service.
- Update API to reflect [Google REST API Design Guide](https://cloud.google.com/apis/design). Please see[OpenAPI Spec](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/refactoring_workflow/docs/api/openapi.workflow.yaml) for details.
## Decision
- Accept API changes as a part of R3
- Accept Workflow > Core changes as a part of R3
- Deprecate exiting Workflow API (startWorkflow, etc)
## Rationale
- Registration of workflows required for E2E R3 Ingestion
- API spec is on critical path for CSV Ingestion
## Consequences
- Most of the Core logic changes will be implemented by GCP
- Will require support of CSPs as SPI layer will be touched.
## When to revisit
- Post R3
## Technical details:
![R3_Workflow_-_L3__Target](/uploads/75f02f3ec73ee85a95bb668dc7426df2/R3_Workflow_-_L3__Target.png)
![R3_Workflow_-_L4__Target](/uploads/03429b8474b61049b4327ae920969374/R3_Workflow_-_L4__Target.png)
### SPI Layer:
- `IWorkflowEngineService` - **Has default implementation.** Abstraction over orchestration engine. By default we have implementation for Airflow.
- `IWorkflowManagerService` - **Has default implementation.** Implements CRUD over Workflow entity.
- `IWorkflowRunService` - - **Has default implementation.** Implements CRUD over Workflow Run entity.
- `IWorkflowMetadataRepository` - Should be implemented by CSP!. Repository for Workflow entity.
- `IWorkflowRunRepository` - Should be implemented by CSP!. Repository for Workflow Run entityM1 - Release 0.1Dmitriy RudkoDmitriy Rudkohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/16Dataset creation fails if there's no default legal tag on Seismic DDMS sub-pr...2023-11-20T17:58:08ZSacha BrantsDataset creation fails if there's no default legal tag on Seismic DDMS sub-projectTo reproduce this issue, create a sub-project without a default legal tag and run the DAG on a SEG-Y dataset in that sub-project.
The creation of the VDS dataset will fail.
The dataset creation should reuse the legal tag set on the SE...To reproduce this issue, create a sub-project without a default legal tag and run the DAG on a SEG-Y dataset in that sub-project.
The creation of the VDS dataset will fail.
The dataset creation should reuse the legal tag set on the SEG-Y dataset as it can differ from the default legal tag set in the sub-project.M20 - Release 0.23Deepa KumariDeepa Kumarihttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/158A custom header 'x-user-id' is used in core part2023-11-08T19:54:10ZRiabokon Stanislav(EPAM)[GCP]A custom header 'x-user-id' is used in core partI wanted to bring to your attention an issue that was identified by our GC Team while they were in the process of addressing https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/157.
org.opengrou...I wanted to bring to your attention an issue that was identified by our GC Team while they were in the process of addressing https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/157.
org.opengroup.osdu.workflow.service.WorkflowRunServiceImpl#addUserId
```
private Map<String, Object> addUserId(String workflowName, TriggerWorkflowRequest request) {
final Map<String, Object> executionContext = request.getExecutionContext();
if (executionContext.get(KEY_USER_ID) != null) {
String errorMessage = String.format("Request to trigger workflow with name %s failed because execution context contains reserved key 'userId'", workflowName);
throw new AppException(400, "Failed to trigger workflow run", errorMessage);
}
String userId = dpsHeaders.getUserId();
log.debug("putting user id: " + userId + " in execution context");
executionContext.put(KEY_USER_ID, userId);
return executionContext;
}
```
The current logic relies on a custom header that is primarily intended for use at an infrastructural level, as outlined in https://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/52. The GC team approved an ADR with the understanding that this custom header would not be utilized within the core codebase.
However, as indicated in https://community.opengroup.org/osdu/platform/deployment-and-operations/helm-charts-azure/-/merge_requests/366, a header named 'x-user-id' is populated with data from 'x-on-behalf-of' using a specific rule. This mechanism aligns with the requirements of the CSP provider but may not be entirely suitable for the Core Part of the Workflow Service.
```
if (jwt_authn[msft_issuer]["appid"] == serviceAccountClientId and on_behalf_of_header ~= nil and on_behalf_of_header ~= '') then
request_handle:headers():add("x-user-id", request_handle:headers():get("x-on-behalf-of"))
else
request_handle:headers():add("x-user-id", jwt_authn[msft_issuer]["appid"])
end
```
This logic introduces **three key issues**:
- The core part of the Workflow service depends on a custom CSP header to execute context, which may not be in alignment with the intended architecture.
- The Workflow service may not operate correctly without <ISTIO> and the accompanying special rule, potentially limiting its usability.
- There is a security concern in that 'x-user-id' is not currently validated on the BackEnd side, allowing any user to utilize it for potentially vested interests.
_As for the third problem_, there is the test case:
1. A user was authorized within Workflow Service.
1. This user uses 'x-user-id' with the name of another user, resulting in the triggering of a workflow under the identity of a different user.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/74ADR: Workflow Service Environment Standardization2021-03-23T11:45:21ZAlan HensonADR: Workflow Service Environment Standardization## Context
Providing consistent workflow runtime environments enables DAGs (Directed Acyclic Graphs) to be written once and run across any standardized workflow service environment. There are some differences in the Workflow Service envi...## Context
Providing consistent workflow runtime environments enables DAGs (Directed Acyclic Graphs) to be written once and run across any standardized workflow service environment. There are some differences in the Workflow Service environments built for R3, so we must agree on the versions of the major components of the Workflow Service to achieve standardization.
## Scope
- All Workflow Service implementations should operate with the same `major.minor` version of Airflow.
- All Workflow Service implementations should operate with the same `major.minor` Python version within Airflow.
- All Workflow Service DAG Operators should be authored to run with the same `major.minor` Python version within Airflow.
## Decision
Standardize on the following Workflow Service component versions
| Component | Version |
| --------- | ------- |
| Airflow | 1.10.x |
| Airflow Python Runtime | 3.6.x |
| DAG Operator Python Development Version | 3.6.x |
## Rationale
- Workflows (DAGs) written against the standard will be portable to all standardized Workflow Service runtime environments.
## Consequences
- Workflow Service implementers may have to change Airflow and Python versions and re-test developed workflows (DAGs)Alan HensonAlan Hensonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/159ADR: Implement Airflow facade endpoint2024-01-08T10:10:33ZRiabokon Stanislav(EPAM)[GCP]ADR: Implement Airflow facade endpoint# Context
OSDU Platform uses Apache Airflow for orchestration of various data ingestion and processing jobs.
# Problem statement
Currently OSDU Airflow component does not support data isolation for multi-tenant deployments. Airflow Admi...# Context
OSDU Platform uses Apache Airflow for orchestration of various data ingestion and processing jobs.
# Problem statement
Currently OSDU Airflow component does not support data isolation for multi-tenant deployments. Airflow Administrative UI is available for all users and makes possible to observe all the processing data for all existing tenants which may cause data leaks and security issues.
# Proposal of the solution
It is proposed to introduce a facade that will replace Airflow admin UI and will collect in a tenant-specific way via the Airflow REST API job execution information (namely its resulting x-com variables). To do this we need to add a new endpoint in the Workflow service API, which will collect the details of the DAG run using the existing Airflow REST API v2.
New API endpoint /v1/workflow/{workflow_name}/workflowRun/{runId}/lastInfo should implement the following business logic:
![image-2023-10-18_17-48-20](/uploads/44f53a3de410b8dff0276b127387f29a/image-2023-10-18_17-48-20.png)
- Get internal workflow entity with getWorkflowRunByName and check if submittedBy corresponds to the user submitted in the header, otherwise return 401 NOT_AUTHORIZED
- Get list of all task instances with /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances where dag_id is workflow_name and dag_run_id is runId
- Select task instance with maximal end_date
- With task_id of the selected task instance get list of xcom entries keys /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/xcomEntries
- Obtain xcom values by theis keys using /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/xcomEntries/{xcom_key}
- Return task instance details from step 3 combined with xcom values map in a single JSON responceM23 - Release 0.26Rustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comRiabokon Stanislav(EPAM)[GCP]Andrei Dalhikh [EPAM/GC]Rustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/81While performing the same workflow using the same data to ingest reference da...2023-05-24T15:35:07ZKamlesh TodaiWhile performing the same workflow using the same data to ingest reference data type, I see that the new record ids are not getting created each time, Instead the version is getting incrementedWhen, try to ingest data where entity type is master, I see the new record id getting created each time I run the workflow, even though the collection and data file being used are same.
To me this is the expected behavior.
When I try to...When, try to ingest data where entity type is master, I see the new record id getting created each time I run the workflow, even though the collection and data file being used are same.
To me this is the expected behavior.
When I try to do the same with entity type reference, I am seeing that the ids getting generated are same (not new) and the only new version is getting generated.
So e.g. if I get count of the records for the entity type before and after the ingestion, the count remains same except for the first time. (in my case the count goes up by 4 for the first time and then it stays the same as my data has 4 records).
`So I modified the file to have one more record (total 5). When I ran the workflow again with additional one record, I saw the count going up by 1 and not by 5.
Before (inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 9
}
],
"totalCount": 9
}
After (inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 10
}
],
"totalCount": 10
}
Before (again inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 10
}
],
"totalCount": 10
}
After (again inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 10
}
],
"totalCount": 10
}
`
[CSVWorkflow__CI-CD_v2.0-ReferenceData.postman_collection.json](/uploads/3d0652f4e89be166a525d7ff18731c2d/CSVWorkflow__CI-CD_v2.0-ReferenceData.postman_collection.json)
[ReferenceData.csv](/uploads/a04f12cbeaacb300c4d23788032518ae/ReferenceData.csv) (with f5 records)
The environment file can be gotten from
https://community.opengroup.org/osdu/platform/pre-shipping/-/tree/main/R3-M16/QA_Artifacts_M16/envFilesAndCollections/envFiles
OR
https://community.opengroup.org/osdu/platform/testing/-/tree/master/Postman%20Collection/00_CICD_Setup_Environment
@tdixon @debasisc @chadM18 - Release 0.21https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/5Adding headers to Put the file on Dataset Service2023-02-13T08:59:19ZJayesh BagulAdding headers to Put the file on Dataset Service**Context:**
This issue goal is to put the Azure manifest json files into storage service. Currently header for this request is not considered in [**osdu-airflow-lib**](https://community.opengroup.org/osdu/platform/data-flow/ingestion/o...**Context:**
This issue goal is to put the Azure manifest json files into storage service. Currently header for this request is not considered in [**osdu-airflow-lib**](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib) The call from [`_put_file_on_dataset_service`](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/blob/master/osdu_airflow/operators/mixins/ReceivingContextMixin.py#L226) which calls the [**osdu_api**](https://community.opengroup.org/osdu/platform/system/sdks/common-python-sdk/-/blob/master/osdu_api/clients/base_client.py#L194) is making the ingestion process fail.
```
<Error>
<Code>MissingRequiredHeader</Code>
<Message>An HTTP header that's mandatory for this request is not specified.
RequestId:bdf6dae2-701e-004f-2529-367fcb000000
Time:2023-02-01T10:37:20.2256774Z</Message>
<HeaderName>x-ms-blob-type</HeaderName>
</Error>
```
The "x-ms-blob-type" header is used in the Azure Blob storage service to specify the type of blob that is being uploaded. It ensures that the correct type of blob is being uploaded.
The accepted values for the "x-ms-blob-type" header is "BlockBlob".
Call from _osdu-airflow-lib_:
`put_result = dataset_dms_client.make_request(method=HttpMethod.PUT, url=signed_url, data=file_content, no_auth=True)`
**Proposal:**
• The dataset service should be able to pass the headers while calling OSDU_API
cc: @Srinivasan_Narayanan @chad @valentin.gauthier @Yan_Sushchynski @nursheikhM16 - Release 0.19Jayesh BagulJayesh Bagulhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/109Manifest ingestion by Reference - error while running DAG for first time2023-02-13T08:56:59ZNaveen RamachandraiahManifest ingestion by Reference - error while running DAG for first timeTeam,
For Azure, we are trying to implement feature Manifest by reference but getting issues while running DAG. Attached the error log and the screenshot of the DAG Graph. Please do help[DAG_-error.log](/uploads/80013a342d3e6d5fbfd843f...Team,
For Azure, we are trying to implement feature Manifest by reference but getting issues while running DAG. Attached the error log and the screenshot of the DAG Graph. Please do help[DAG_-error.log](/uploads/80013a342d3e6d5fbfd843fcf27c0707/DAG_-error.log)![DAG-_tree](/uploads/e09e4f01cfb6b1a4166a4df2efa83e4d/DAG-_tree.png)M16 - Release 0.19Jayesh BagulJayesh Bagulhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/14VDS converter does not upload dag files to storage account.2022-12-12T20:08:24ZNur SheikhVDS converter does not upload dag files to storage account.On checking the pipeline logs, it was found that the DAG files are not being uploaded to Storage account. https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/jobs/1268817 .On checking the pipeline logs, it was found that the DAG files are not being uploaded to Storage account. https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/jobs/1268817 .shivani karipeshivani karipehttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/4Dataset API "getRetrievalInstructions" returns 500 error code when issuing a...2023-03-29T11:54:12ZNaveen RamachandraiahDataset API "getRetrievalInstructions" returns 500 error code when issuing a request.In Glab, Validation and Preshipping environment a 500 error is returned when issuing a valid request to getRetrievalInstructions (see request/responses get_retrieval_instructions_500_unknown_validation_env.txt) whereas the other endpoint...In Glab, Validation and Preshipping environment a 500 error is returned when issuing a valid request to getRetrievalInstructions (see request/responses get_retrieval_instructions_500_unknown_validation_env.txt) whereas the other endpoint returns a correct retrieval_instructions_200_validation_env.txt)
Attached also glab dataset_get_retrieval_information_500_error.txt for the error.
Request: **getRetrievalInstructions** returns 500 error code
```
curl -location --request GET 'https://osdu-demo.msft-osdu-test.org/api/dataset/v1/getRetrievalInstructions?id=opendes:dataset-File.Generic:67be95dcfdb54343aec877fbb52bdd89' \
--header 'Data-Partition-Id: opendes' \
--header 'Authorization: Bearer TOKEN' \
--header 'Content-Type: application/json'
```
Response:
```
{"code":500,"reason":"Server error.","message":"An unknown error has occurred."}
```
Request: **retrievalinstructions** endpoint returns 200 Success Response
```
curl -location --request GET 'https://osdu-demo.msft-osdu-test.org/api/dataset/v1/retrievalInstructions?id=opendes:dataset-File.Generic:67be95dcfdb54343aec877fbb52bdd89' \
--header 'Data-Partition-Id: opendes' \
--header 'Authorization: Bearer TOKEN' \
--header 'Content-Type: application/json'
```
Response:
```
{
"providerKey": "AZURE",
"datasets": [
{
"datasetRegistryId": "opendes:dataset--File.Generic:67be95dcfdb54343aec877fbb52bdd89",
"retrievalProperties":
{ "signedUrl": "https://osdumvpdp1demoesyxdata.blob.core.windows.net/file-persistent-area/osdu-user%2F1662482672814-2022-09-06-16-44-32-814%2F25b93bc2553e48b3832b4e8dd20a646d?sv=2020-08-04&se=2022-09-13T23%3A42%3A25Z&sr=b&sp=r&sig=YAMtXJ2MPPsYov1cm3uEySlx%2FlomITbMEN%2BoL7tzd5c%3D", "createdBy": "osdu-user" }
}
]
}
```
**Analysis**
Dataset API (/api/dataset/v1/**getRetrievalInstructions**) calls File API DMSHandler (api/file/v2/files/retrievalInstructions) based on the request Kind File.Generic
[GetDatasetRetrievalInstructionsResponse] ---> [RetrievalInstructionsResponse]
From File API response Type(RetrievalInstructionsResponse) cannot able to deserialize to Dataset API response Type (GetDatasetRetrievalInstructionsResponse), which causes null pointer exception.
**Note:** Dataset API (/api/dataset/v1/**retrievalInstructions**) endpoint response type is same as File API response type 'RetrievalInstructionsResponse', hence this endpoint is returning 200 success response for the same request.
**Solution**
To generate 'GetDatasetRetrievalInstructionsResponse' from 'RetrievalInstructionsResponse' instead of directly deserializing it.M15 - Release 0.18Nur SheikhNur Sheikhhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-zgy-conversion/-/issues/21Clean up old azure dags templates and expired devops scripts2022-09-20T11:36:08ZVadzim KulybaClean up old azure dags templates and expired devops scriptsM14 - Release 0.17Vadzim KulybaVadzim Kulybahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/13SegY to oVDS conversion in sdstore2023-03-20T11:46:40ZChad LeongSegY to oVDS conversion in sdstore## Introduction
For oVDS conversion, the converter supports native cloud location and sdstore as long as the location of the segy is provided in the segy_file.
The practice is for all segy/oVDS/oZGY file to be stored in the sdstore so ...## Introduction
For oVDS conversion, the converter supports native cloud location and sdstore as long as the location of the segy is provided in the segy_file.
The practice is for all segy/oVDS/oZGY file to be stored in the sdstore so that applications can access the seismic data via the sdms API.
All CSP should ensure the following implementation of segy-oVDS conversion in the sdstore. Here are some examples of expected implementation of the workflow from IBM, GCP, Azure.
Example working procedure - postman collection from IBM: https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M12/IBM-M12/M12-IBM_ODI_R3_v2.0.1_SEGY-to-Open_VDS_Conversion_Collection.postman_collection.json
Example working procedure - postman collection from GCP: https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M12/GCP-M12/OpenVDS_SSDMS_to_SSDMS_conversion_CI-CD.postman_collection.json
Status
- [x] AWS - M16
- [x] IBM - Working
- [x] GCP - Working
- [x] Azure - Added in M13
## Discrepancies observed in AWS
```json
{
"executionContext": {
"data-partition-id": "{{data_partition_id}}",
"url_connection": "Region={{vdsUrlConnectionStringRegion}};AccessKeyId={{vdsUrlConnectionStringAccessKeyId}};SecretKey={{vdsUrlConnectionStringSecretAccessKey}};SessionToken={{vdsUrlConnectionStringSessionToken}};Expiration={{vdsUrlConnectionStringExpiration}}",
"input_connection": "Region={{vdsInputConnectionStringRegion}};AccessKeyId={{vdsInputConnectionStringAccessKeyId}};SecretKey={{vdsInputConnectionStringSecretAccessKey}};SessionToken={{vdsInputConnectionStringSessionToken}};Expiration={{vdsInputConnectionStringExpiration}}",
"segy_file": "{{fileSource}}",
"url": ""
}
}
```
You can see that the executionContext is missing several keys like :
```
"work_product_id": "{{work-product-id}}",
"file_record_id": "{{file-record-id}}",
"persistent_id": "{{vds_id}}",
"id_token": "{{id_token}}"
```
These ids are needed to correctly fetch the seismic parameters (e.g. Inline, crossline etc.) to perform the oVDS conversion.
Expected implementation is observed in - IBM / GCP / Azure [M13]
```json
{
"executionContext": {
"Payload": {
"AppKey": "test-app",
"data-partition-id": "{{data-partition-id}}"
},
"vds_url": "{{test_vds_url}}",
"work_product_id": "{{work-product-id}}",
"file_record_id": "{{file-record-id}}",
"persistent_id": "{{vds_id}}",
"id_token": "{{id_token}}"
}
}
```M16 - Release 0.19Okoun-Ola Fabien HouetoOkoun-Ola Fabien Houetohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/144WhiteSource update2022-08-23T21:24:04ZMaksim MalkovWhiteSource updateUpdate `core` and `azure` modules according to WS reports.Update `core` and `azure` modules according to WS reports.M12 - Release 0.15Maksim MalkovMaksim Malkovhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/75WhiteSource update2022-08-23T21:24:10ZMaksim MalkovWhiteSource updateUpdate `core` and `azure` modules according to WS reports.Update `core` and `azure` modules according to WS reports.M12 - Release 0.15Maksim MalkovMaksim Malkovhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/105Performance testing in R3 M11 - Need to determine the maximum size of the pa...2024-03-20T15:49:55ZKamlesh TodaiPerformance testing in R3 M11 - Need to determine the maximum size of the payload allowed during ingestion using Osdu_ingest DAGAll,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0
The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used ...All,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0
The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used the time required to ingest the data goes down ( performance gain)
Some observations of the process used.
There is a difference in the python scripts that are used to generate the payload for Ingestion and batch_upload.
The python script that generates the payload for ingestion, generates records of kind: “opendes:wks:master-data--Organisation:1.0.0”. So when user specifies 5 records, it generates 5 records for kind Organisation
The python script that generates the payload for batch_upload, generates records of "osdu:wks:master-data--Organisation:1.0.0 and osdu:wks:reference-data--ContractorType:1.0.0". So when a user specifies 5 records, it generates records of kind
Organisation and ContractorType. So it is actually generating twice the amount of records that were specified.
At present to establish the benchmark for the performance, we are using the number of records. Probably because it is convenient to tell the users that to ingest a certain number of wells for example takes x amount of time.
But the well record size may vary from one user environment to another and hence performance numbers derived using a number of records may not hold true in all the situations.
How much one can ingest in one job or one time is based on the size of the payload in KB. So I think we should use the payload size in KB to establish the benchmark. The number of records that can fit in the payload would depend on the size of the records
I have done the testing in the IBM environment, but the test for 50000 records in batch_upload seems to be failing in all the environments.
I do not know, where the size limit is coming from? REST API, Network, Airflow, DAG implementation.
Nor do I know whether that size is configurable?
It is important for us to understand where that limitation is coming from and whether it is a hard limit or a configurable limit.
The python script file should honor that limit and generate data/payload files (multiple) containing a correct number of records to avoid failures.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/133Update dependencies accoriding WhiteSource reports[SLB]2022-05-10T08:03:36ZMaksim MalkovUpdate dependencies accoriding WhiteSource reports[SLB]This is just a regular update raised by the WhiteSource check we have conducted on the SLB side.
Dependencies updates for:
* root pom
* core module pom
* azure module pomThis is just a regular update raised by the WhiteSource check we have conducted on the SLB side.
Dependencies updates for:
* root pom
* core module pom
* azure module pomM10 - Release 0.13Maksim MalkovMaksim Malkovhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/issues/3Updates for azure provider file handing2022-01-10T07:55:49ZVadzim KulybaUpdates for azure provider file handingUploading file request on Azure provider has a specific headerUploading file request on Azure provider has a specific headerVadzim KulybaVadzim Kulybahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/97"Broken DAG for manifest ingestion.2022-01-23T10:58:41ZAbhijeet Sawant"Broken DAG for manifest ingestion.Airflow UI showing import error- "Broken DAG: [/opt/airflow/dags/manifest_ingestion_dags.zip] No module named 'osdu_ingestion.libs.auth'"
Image deployed - repository: msosdu.azurecr.io/airflow-docker-image tag: v0.10![airflow_import_err...Airflow UI showing import error- "Broken DAG: [/opt/airflow/dags/manifest_ingestion_dags.zip] No module named 'osdu_ingestion.libs.auth'"
Image deployed - repository: msosdu.azurecr.io/airflow-docker-image tag: v0.10![airflow_import_error](/uploads/ecc51fc6ba34574bc852832ee0349177/airflow_import_error.JPG)
Continuous alerts are getting triggered.Kishore BattulaKishore Battulahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/9E2E Tests for SEGY-OpenVDS Parser - MSFT2022-08-30T15:20:34ZChris ZhangE2E Tests for SEGY-OpenVDS Parser - MSFTThis is to track the MSFT team's work for E2E Tests for SEGY-ZGY Parser.
Related to issue 2: https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/2This is to track the MSFT team's work for E2E Tests for SEGY-ZGY Parser.
Related to issue 2: https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/2M10 - Release 0.13Vadzim KulybaKrishnan GanesanVadzim Kulyba