Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2024-01-08T12:04:35Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/core-external-data-workflow/-/issues/56Fetch-and-Ingest - authentication uses flow type value of "RefreshTokenKeyNam...2024-01-08T12:04:35ZDebasis ChatterjeeFetch-and-Ingest - authentication uses flow type value of "RefreshTokenKeyName" although valid value is "RefreshToken"When I used value of RefreshToken then I get this error.
[2024-01-02, 13:41:17 UTC] {token_generator.py:43} INFO - OAuth Flow type : RefreshToken Not supported!
But this is valid per this site.
https://community.opengroup.org/osdu/data...When I used value of RefreshToken then I get this error.
[2024-01-02, 13:41:17 UTC] {token_generator.py:43} INFO - OAuth Flow type : RefreshToken Not supported!
But this is valid per this site.
https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/reference-data/OAuth2FlowType.1.0.0.md
I had to work around by using a value of "RefreshTokenKeyName".
cc @priyankabhongadeM23 - Release 0.26https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/issues/12Segy to VDS conversion creates non-compliant work product component2024-01-08T12:09:03ZIvan Medeiros MonteiroSegy to VDS conversion creates non-compliant work product componentWhen conversion from Segy to VDS is triggered, the file collection for the new VDS is created and it is associated with the work product component (SeismicTraceData) as an artefact. However, the role of the artefact is specified by the p...When conversion from Segy to VDS is triggered, the file collection for the new VDS is created and it is associated with the work product component (SeismicTraceData) as an artefact. However, the role of the artefact is specified by the property "RoleId" instead of "RoleID", which makes the record to fail the schema validation.
Example of error message: "data.Artefacts.0.RoleId: Additional property not allowed"
The schema definition of this association can be seen here : https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/work-product-component/SeismicTraceData.1.5.1.md
And the implementation of this association can be seen here: https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/blob/master/osdu_ingestion/libs/segy_conversion_metadata/base_metadata.py?ref_type=heads#L130M22 - Release 0.25https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/11EDS Naturalization have circular import2023-12-19T19:11:46ZBruce JinEDS Naturalization have circular importThe file `osdu_airflow/eds/eds_naturalization/signed_url_details/abstract/environment_factory.py` and file `osdu_airflow/eds/eds_naturalization/signed_url_details/concrete/operator_environment_factory.py` is trying to import each other.
...The file `osdu_airflow/eds/eds_naturalization/signed_url_details/abstract/environment_factory.py` and file `osdu_airflow/eds/eds_naturalization/signed_url_details/concrete/operator_environment_factory.py` is trying to import each other.
Which disables the component.M22 - Release 0.25https://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/eds-dms/-/issues/18Side effect to ingest configuration files of EDS DMS2023-11-23T13:06:23ZRiabokon Stanislav(EPAM)[GCP]Side effect to ingest configuration files of EDS DMSThe GC Team has identified an issue. According to the architectural design of this service, the procedure involves creating configuration files within the Storage Service. Subsequently, new records will be indexed by the Indexer Search a...The GC Team has identified an issue. According to the architectural design of this service, the procedure involves creating configuration files within the Storage Service. Subsequently, new records will be indexed by the Indexer Search and placed into Elastic Search. As a result, these records become discoverable through the Search Service.
Current Arch:
![image](/uploads/82a2a551574d5772a2105b64fcf27950/image.png)
`https://community.gcp.gnrg-osdu.projects.epam.com/api/search/v2/query`
bode:
```
{
"kind": "osdu:wks:reference-data--SecuritySchemeType:1.0.0"
}
```
response:
```
{
"results": [
{
"data": {
"AttributionPublication": null,
"InactiveIndicator": null,
"Description": "An open and industry-standard protocol for authorization",
"ResourceLifecycleStatus": null,
"ResourceCurationStatus": null,
"TechnicalAssuranceID": null,
"Code": "OAuth2",
"Source": "SecuritySchemeType.1.0.0.xlsx",
"Name": "OAuth 2.0",
"AttributionAuthority": "OSDU",
"ResourceHomeRegionID": null,
"VirtualProperties.DefaultName": "OAuth 2.0",
"AttributionRevision": null,
"ResourceSecurityClassification": null,
"ID": "OAuth2",
"ExistenceKind": null
},
"kind": "osdu:wks:reference-data--SecuritySchemeType:1.0.0",
"source": "wks",
"acl": {
"viewers": [
"data.default.viewers@osdu.group"
],
"owners": [
"data.default.owners@osdu.group"
]
},
"type": "reference-data--SecuritySchemeType",
"version": 1697963580525660,
"tags": {
"normalizedKind": "osdu:wks:reference-data--SecuritySchemeType:1"
},
"modifyUser": "osdu-community-sa-airflow@nice-etching-277309.iam.gserviceaccount.com",
"modifyTime": "2023-10-22T08:33:00.665Z",
"createTime": "2022-09-30T10:26:21.248Z",
"authority": "osdu",
"namespace": "osdu:wks",
"legal": {
"legaltags": [
"osdu-demo-legaltag"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"createUser": "osdu-community-sa-airflow@nice-etching-277309.iam.gserviceaccount.com",
"id": "osdu:reference-data--SecuritySchemeType:OAuth2"
},
{
"data": {
"AttributionPublication": null,
"InactiveIndicator": null,
"Description": "Requests are authenticated using an access key, such as a JSON Web Token, in the request header.",
"ResourceLifecycleStatus": null,
"ResourceCurationStatus": null,
"TechnicalAssuranceID": null,
"Code": "Bearer",
"Source": "SecuritySchemeType.1.0.0.xlsx",
"Name": "Bearer Token",
"AttributionAuthority": "OSDU",
"ResourceHomeRegionID": null,
"VirtualProperties.DefaultName": "Bearer Token",
"AttributionRevision": null,
"ResourceSecurityClassification": null,
"ID": "Bearer",
"ExistenceKind": null
},
"kind": "osdu:wks:reference-data--SecuritySchemeType:1.0.0",
"source": "wks",
"acl": {
"viewers": [
"data.default.viewers@osdu.group"
],
"owners": [
"data.default.owners@osdu.group"
]
},
"type": "reference-data--SecuritySchemeType",
"version": 1697963580525660,
"tags": {
"normalizedKind": "osdu:wks:reference-data--SecuritySchemeType:1"
},
"modifyUser": "osdu-community-sa-airflow@nice-etching-277309.iam.gserviceaccount.com",
"modifyTime": "2023-10-22T08:33:00.665Z",
"createTime": "2022-09-30T10:28:21.843Z",
"authority": "osdu",
"namespace": "osdu:wks",
"legal": {
"legaltags": [
"osdu-demo-legaltag"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"createUser": "osdu-community-sa-airflow@nice-etching-277309.iam.gserviceaccount.com",
"id": "osdu:reference-data--SecuritySchemeType:Bearer"
}
],
"aggregations": null,
"totalCount": 2
}
```
It appears there may be a potential security concern within the EDS Service architecture.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/159ADR: Implement Airflow facade endpoint2024-01-08T10:10:33ZRiabokon Stanislav(EPAM)[GCP]ADR: Implement Airflow facade endpoint# Context
OSDU Platform uses Apache Airflow for orchestration of various data ingestion and processing jobs.
# Problem statement
Currently OSDU Airflow component does not support data isolation for multi-tenant deployments. Airflow Admi...# Context
OSDU Platform uses Apache Airflow for orchestration of various data ingestion and processing jobs.
# Problem statement
Currently OSDU Airflow component does not support data isolation for multi-tenant deployments. Airflow Administrative UI is available for all users and makes possible to observe all the processing data for all existing tenants which may cause data leaks and security issues.
# Proposal of the solution
It is proposed to introduce a facade that will replace Airflow admin UI and will collect in a tenant-specific way via the Airflow REST API job execution information (namely its resulting x-com variables). To do this we need to add a new endpoint in the Workflow service API, which will collect the details of the DAG run using the existing Airflow REST API v2.
New API endpoint /v1/workflow/{workflow_name}/workflowRun/{runId}/lastInfo should implement the following business logic:
![image-2023-10-18_17-48-20](/uploads/44f53a3de410b8dff0276b127387f29a/image-2023-10-18_17-48-20.png)
- Get internal workflow entity with getWorkflowRunByName and check if submittedBy corresponds to the user submitted in the header, otherwise return 401 NOT_AUTHORIZED
- Get list of all task instances with /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances where dag_id is workflow_name and dag_run_id is runId
- Select task instance with maximal end_date
- With task_id of the selected task instance get list of xcom entries keys /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/xcomEntries
- Obtain xcom values by theis keys using /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/xcomEntries/{xcom_key}
- Return task instance details from step 3 combined with xcom values map in a single JSON responceM23 - Release 0.26Rustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comRiabokon Stanislav(EPAM)[GCP]Andrei Dalhikh [EPAM/GC]Rustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/158A custom header 'x-user-id' is used in core part2023-11-08T19:54:10ZRiabokon Stanislav(EPAM)[GCP]A custom header 'x-user-id' is used in core partI wanted to bring to your attention an issue that was identified by our GC Team while they were in the process of addressing https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/157.
org.opengrou...I wanted to bring to your attention an issue that was identified by our GC Team while they were in the process of addressing https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/157.
org.opengroup.osdu.workflow.service.WorkflowRunServiceImpl#addUserId
```
private Map<String, Object> addUserId(String workflowName, TriggerWorkflowRequest request) {
final Map<String, Object> executionContext = request.getExecutionContext();
if (executionContext.get(KEY_USER_ID) != null) {
String errorMessage = String.format("Request to trigger workflow with name %s failed because execution context contains reserved key 'userId'", workflowName);
throw new AppException(400, "Failed to trigger workflow run", errorMessage);
}
String userId = dpsHeaders.getUserId();
log.debug("putting user id: " + userId + " in execution context");
executionContext.put(KEY_USER_ID, userId);
return executionContext;
}
```
The current logic relies on a custom header that is primarily intended for use at an infrastructural level, as outlined in https://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/52. The GC team approved an ADR with the understanding that this custom header would not be utilized within the core codebase.
However, as indicated in https://community.opengroup.org/osdu/platform/deployment-and-operations/helm-charts-azure/-/merge_requests/366, a header named 'x-user-id' is populated with data from 'x-on-behalf-of' using a specific rule. This mechanism aligns with the requirements of the CSP provider but may not be entirely suitable for the Core Part of the Workflow Service.
```
if (jwt_authn[msft_issuer]["appid"] == serviceAccountClientId and on_behalf_of_header ~= nil and on_behalf_of_header ~= '') then
request_handle:headers():add("x-user-id", request_handle:headers():get("x-on-behalf-of"))
else
request_handle:headers():add("x-user-id", jwt_authn[msft_issuer]["appid"])
end
```
This logic introduces **three key issues**:
- The core part of the Workflow service depends on a custom CSP header to execute context, which may not be in alignment with the intended architecture.
- The Workflow service may not operate correctly without <ISTIO> and the accompanying special rule, potentially limiting its usability.
- There is a security concern in that 'x-user-id' is not currently validated on the BackEnd side, allowing any user to utilize it for potentially vested interests.
_As for the third problem_, there is the test case:
1. A user was authorized within Workflow Service.
1. This user uses 'x-user-id' with the name of another user, resulting in the triggering of a workflow under the identity of a different user.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/157Pass workflow user ID to the Airflow as part of payload.2023-11-08T19:54:46ZRiabokon Stanislav(EPAM)[GCP]Pass workflow user ID to the Airflow as part of payload.This issue was discovered by GC Team when the QA Team was testing a platform.
It revolves around triggering workflows and the addition of the User ID into the execution context through the 'x-user-id' header.
Upon further investigation,...This issue was discovered by GC Team when the QA Team was testing a platform.
It revolves around triggering workflows and the addition of the User ID into the execution context through the 'x-user-id' header.
Upon further investigation, we came across the(MR) https://community.opengroup.org/osdu/platform/deployment-and-operations/helm-charts-azure/-/merge_requests/366, which appears to implement this logic with a dependency on the infrastructural level.
However, we have to add some kind of validation or additional logic to use a header 'user' in core logic. This adjustment is essential as we might want to use the service without a service mesh or similar infrastructure.
org.opengroup.osdu.workflow.service.WorkflowRunServiceImpl#addUserId
```
private Map<String, Object> addUserId(String workflowName, TriggerWorkflowRequest request) {
final Map<String, Object> executionContext = request.getExecutionContext();
if (executionContext.get(KEY_USER_ID) != null) {
String errorMessage = String.format("Request to trigger workflow with name %s failed because execution context contains reserved key 'userId'", workflowName);
throw new AppException(400, "Failed to trigger workflow run", errorMessage);
}
String userId = dpsHeaders.getUserId();
log.debug("putting user id: " + userId + " in execution context");
executionContext.put(KEY_USER_ID, userId);
return executionContext;
}
```M21 - Release 0.24https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/7Static analyzer fails in EDS operators2023-09-20T05:19:39ZYan Sushchynski (EPAM)Static analyzer fails in EDS operatorsJob [#2183134](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/jobs/2183134) failed for b2fb09af463e7d884562ea6ae89535a11eaa552f:
Could I ask you have a look at the failed job?Job [#2183134](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/jobs/2183134) failed for b2fb09af463e7d884562ea6ae89535a11eaa552f:
Could I ask you have a look at the failed job?M21 - Release 0.24Ashish SaxenaJeyakumar DevarajuluAshish Saxenahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/issues/11Upgrade json schema version to support Airflow constraint file2023-08-30T19:40:15ZGuillaume CailletUpgrade json schema version to support Airflow constraint fileAirflow added Python "constraints" files a while ago: https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html#constraints-files
These files lock the `jsonschema` version, a library used in `osdu-inge...Airflow added Python "constraints" files a while ago: https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html#constraints-files
These files lock the `jsonschema` version, a library used in `osdu-ingestion-lib` (usually to 4.X see for example this constraint file for the latest Airflow version: https://raw.githubusercontent.com/apache/airflow/constraints-2.6.3/constraints-3.10.txt)
But this creates issue with the current `setup.py` file which requires a very specific version (3.2.0) and so the Pip resolver can't find a compatible version:
```
osdu-ingestion 0.23.0rc479+c8d6c217 depends on jsonschema==3.2.0
The user requested (constraint) jsonschema==4.17.3
```https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/154Workflow Run API - requires datapartitionId in body as well as header2023-10-26T12:23:43ZSurabhi SethWorkflow Run API - requires datapartitionId in body as well as header
API: Workflow Service API \> Workflow Run /workflow/{workflow_name}/workflowRun
This service takes data-partition-id as part of the headers as well as payload body { "executionContext": { "id": "string", \*\* "dataPartitionId": "string...
API: Workflow Service API \> Workflow Run /workflow/{workflow_name}/workflowRun
This service takes data-partition-id as part of the headers as well as payload body { "executionContext": { "id": "string", \*\* "dataPartitionId": "string"\*\* }, "runId": "string" }
![MicrosoftTeams-image__5\<span data-escaped-char\>\_\</span\>](/uploads/5e8d61cdc1316019ab905597094525b9/MicrosoftTeams-image__5_.png)Issue: Requesting for dataPartitionId in the payload body is redundant, and inconsistent with the implementation of all other OSDU API's (where data-partition-id is used from the header)
Ref: https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/master/docs/api/openapi.workflow.yaml?plain=0Chad LeongDeepa KumariChad Leonghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/core-external-data-workflow/-/issues/32Versal Spatial Data Ingestion (While Ingesting the data, getting Spatial Coor...2023-10-05T05:07:25ZSelva Kumar SenathipathyVersal Spatial Data Ingestion (While Ingesting the data, getting Spatial Coordinate block as Empty)As part of Versal OSDU integration, spatial coordinate blocks are inserted as empty blocks into OSDU Target system.
While going through the air flow code we found the below observation,
We found that in FetchAndIngest the cleaning pro...As part of Versal OSDU integration, spatial coordinate blocks are inserted as empty blocks into OSDU Target system.
While going through the air flow code we found the below observation,
We found that in FetchAndIngest the cleaning process of the records is removing the coordinates, when coordinates has nested list. Seems like cleaning process supports only point type of geometry but versal has Multiline String and Multi Polygon type of geometry with nested list of coordinates.
For nested list e.g. [[-0.7484, 61.4182], [-0.9396, 61.4893]] the following method (_iterate_list) is returning empty list. Please find below snapshot of the methods where we think it is removing the coordinates values when coordinates are in the form of nested list.
**Repo Link**:_ https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/blob/master/osdu_airflow/eds/eds_ingest/clean_records.py_
![Air_Flow_clean_process](/uploads/8f41d0527d37d578f1395d4af5d1993b/Air_Flow_clean_process.jpg)
![Air_Flow_List_Iterate](/uploads/aec8664b4a47290f61f7425025d8dab4/Air_Flow_List_Iterate.jpg)M20 - Release 0.23Priyanka BhongadePriyanka Bhongadehttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/18Capture OpenVDS library version number in DAG name or some such suitable place2024-03-26T11:21:36ZDebasis ChatterjeeCapture OpenVDS library version number in DAG name or some such suitable placeConsider exposing this information prominently, over and above showing in Airflow log.
cc @chad , @Keith_WallConsider exposing this information prominently, over and above showing in Airflow log.
cc @chad , @Keith_WallDeepa KumariDeepa Kumarihttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/issues/10Manifest ingestion fail on non-file-based datasets2023-04-26T17:17:49ZLaurent DenyManifest ingestion fail on non-file-based datasetsThe validation script validate_file_source.py, is rejecting datasets that are not matching the hardcoded types:
* FILE = ":dataset--File."
* FILE_COLLECTION = ":dataset--FileCollection."
* EDS_FILE = ":dataset--ConnectedSource."
Other d...The validation script validate_file_source.py, is rejecting datasets that are not matching the hardcoded types:
* FILE = ":dataset--File."
* FILE_COLLECTION = ":dataset--FileCollection."
* EDS_FILE = ":dataset--ConnectedSource."
Other dataset types that are not associated with files, should not be checked for any of the file attributes such as `FileSource`. For example, the following manifest that contains an ETP dataset should be bypassing the file associated tests.
```json
{
"kind": "osdu:wks:Manifest:1.0.0",
"Data": {
"Datasets": [
{
"acl": {
"viewers": [
"data.default.viewers@opendes.contoso.com"
],
"owners": [
"data.default.owners@opendes.contoso.com"
]
},
"kind": "osdu:wks:dataset--ETPDataspace:1.0.0",
"legal": {
"legaltags": [
"opendes-ReservoirDDMS-Legal-Tag"
],
"otherRelevantDataCountries": [
"US",
"UK"
]
},
"createTime": "2023-03-21T16:33:19.651Z",
"modifyTime": "2023-03-21T16:33:19.651Z",
"id": "opendes:dataset--ETPDataspace:M16_Demo-Volve_Reservoir",
"version": 1,
"data": {
"ExistenceKind": "opendes:reference-data--ExistenceKind:Actual:",
"DatasetProperties": {
"URI": "eml:///dataspace('M16_Demo/Volve_Reservoir')"
},
"Name": "M16_Demo/Volve_Reservoir"
}
}
]
}
}
```M17 - Release 0.20Chad LeongChad Leonghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/issues/9field schema-id replaced by {{data-partition-id}}2023-05-22T11:13:52Zli shuangqifield schema-id replaced by {{data-partition-id}}here use "{{data-partition-id}}" replace schema-id "**{{data-partition-id}}:wks:AbstractWPCGroupType:1.0.0**" is a bug.the first section of schema-id is authority(OSDU).An error is reported during schema validation when we use different ...here use "{{data-partition-id}}" replace schema-id "**{{data-partition-id}}:wks:AbstractWPCGroupType:1.0.0**" is a bug.the first section of schema-id is authority(OSDU).An error is reported during schema validation when we use different partition.
`field.replace("{{data-partition-id}}", self.context.data_partition_id)`
`SURROGATE_KEYS_PATHS = [
("definitions", "**{{data-partition-id}}:wks:AbstractWPCGroupType:1.0.0**", "properties", "Datasets",
"items"),
("definitions", "{{data-partition-id}}:wks:AbstractWPCGroupType:1.0.0", "properties", "Artefacts",
"items", "properties", "ResourceID"),
("properties", "data", "allOf", 1, "properties", "Components", "items"),
]`M18 - Release 0.21https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/issues/64Refactor DAG related code2023-04-04T10:49:00ZYan Sushchynski (EPAM)Refactor DAG related code### Introduction
There is DAG related code that is executed in the container during a DAG run. The code is [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/blob/master/energistics/src...### Introduction
There is DAG related code that is executed in the container during a DAG run. The code is [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/blob/master/energistics/src/witsml_parser/main.py) and [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/blob/master/energistics/src/witsml_parser/energistics/libs/create_energistics_manifest.py). And this code looks messy and outdated, and requires some refactoring.
### What should be done?
1. Update the code to make it work with the most recent `osdu-*` Python libs. The dependencies are here https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/blob/master/build/requirements.txt
2. Delete deprecated functionality of processing files by `preload_file_path` [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/blob/master/energistics/src/witsml_parser/energistics/libs/create_energistics_manifest.py#L314).
3. Add the static-analysis step in the CI/CD.
4. Add possibility to pass the user's access/id token to the DAG
5. Common refactoring, because the code is messy now (a lot of "ifs" and lines of code in a single function)M17 - Release 0.20Vadzim Kulybaharshit aggarwalWalter Detienne peyssonMarc Burnie [AWS]Vadzim Kulybahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/issues/5Adding headers to Put the file on Dataset Service2023-02-13T08:59:19ZJayesh BagulAdding headers to Put the file on Dataset Service**Context:**
This issue goal is to put the Azure manifest json files into storage service. Currently header for this request is not considered in [**osdu-airflow-lib**](https://community.opengroup.org/osdu/platform/data-flow/ingestion/o...**Context:**
This issue goal is to put the Azure manifest json files into storage service. Currently header for this request is not considered in [**osdu-airflow-lib**](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib) The call from [`_put_file_on_dataset_service`](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/blob/master/osdu_airflow/operators/mixins/ReceivingContextMixin.py#L226) which calls the [**osdu_api**](https://community.opengroup.org/osdu/platform/system/sdks/common-python-sdk/-/blob/master/osdu_api/clients/base_client.py#L194) is making the ingestion process fail.
```
<Error>
<Code>MissingRequiredHeader</Code>
<Message>An HTTP header that's mandatory for this request is not specified.
RequestId:bdf6dae2-701e-004f-2529-367fcb000000
Time:2023-02-01T10:37:20.2256774Z</Message>
<HeaderName>x-ms-blob-type</HeaderName>
</Error>
```
The "x-ms-blob-type" header is used in the Azure Blob storage service to specify the type of blob that is being uploaded. It ensures that the correct type of blob is being uploaded.
The accepted values for the "x-ms-blob-type" header is "BlockBlob".
Call from _osdu-airflow-lib_:
`put_result = dataset_dms_client.make_request(method=HttpMethod.PUT, url=signed_url, data=file_content, no_auth=True)`
**Proposal:**
• The dataset service should be able to pass the headers while calling OSDU_API
cc: @Srinivasan_Narayanan @chad @valentin.gauthier @Yan_Sushchynski @nursheikhM16 - Release 0.19Jayesh BagulJayesh Bagulhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/150Misleading log statements2022-12-12T15:35:32ZMaksim MalkovMisleading log statementsWorkflow service search for a triggered workflow first in provided data partition. System workflow like CSV would not be available in data partition. In such cases service publish logs "workflow not found"
Next same workflow is searched ...Workflow service search for a triggered workflow first in provided data partition. System workflow like CSV would not be available in data partition. In such cases service publish logs "workflow not found"
Next same workflow is searched in system db and it is found there and processing completes
But these logs are creating a confusion that some workflow is not found by workflow service, but actually there is no such issue.M16 - Release 0.19https://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/eds-dms/-/issues/5EDS DMS - Schema changes2023-02-16T19:05:10ZJeyakumar DevarajuluEDS DMS - Schema changesSome of the attributes in ConnectedSourceRegistryEntry and ConnectedSourceDataJob Schema have been changed to adhere to the OSDU naming convention and standard before the M13 release.
EDS DMS uses SecuritySchemes to connect to external ...Some of the attributes in ConnectedSourceRegistryEntry and ConnectedSourceDataJob Schema have been changed to adhere to the OSDU naming convention and standard before the M13 release.
EDS DMS uses SecuritySchemes to connect to external system.
Below are few changes in ConnectedSourceRegistrySchema attributes. If any of the below are attributes are used in eds dms data model, please change it.
Attribute ClientIdKeyName will have Azurevalue Key, Using the key and the value should be fetched from Azure Key vault using secret service. Already secret service implementation was part EDS DMS code
Note: If the attributes contain KeyName, then it will be stored in Azure Key Vault and it should be fetched using Secret Service EX: ClientSecretKeyName
| Old Name | New Name |
| ------ | -------- |
| Type | TypeID |
| FlowType | FlowTypeID |
| callbackUrl | CallbackUrl |
| authorizationUrl | AuthorizationUrl |
| ScopesKey | ScopesKeyName |
| ClientSecretKey | ClientSecretKeyName |
| ClientID | ClientIDKeyName |
| RefreshTokenKey | RefreshTokenKeyName |
| AccessTokenKey | AccessTokenKeyName |
| APIKeyKey | APIKeyKeyName |
| UsernameKey | UsernameKeyName |
| PasswordKey | PasswordKeyName |M15 - Release 0.18Thulasi Dass SubramanianThulasi Dass Subramanianhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/79Error diagnostics - need to improve significantly2022-12-13T00:31:21ZDebasis ChatterjeeError diagnostics - need to improve significantlyYou may start of by checking here.
https://community.opengroup.org/osdu/platform/pre-shipping/-/tree/main/R3-M14/AWS-M14/Ingestion%20DAG%20CSV
For each and every problem, I did not get suitable clue from error log.
1. problem in data. ...You may start of by checking here.
https://community.opengroup.org/osdu/platform/pre-shipping/-/tree/main/R3-M14/AWS-M14/Ingestion%20DAG%20CSV
For each and every problem, I did not get suitable clue from error log.
1. problem in data. ELEVATION has non numeric value.
2. problem in schema - TVD, Latitude, Longitude - missed "type=string".
3. At times when the file is missed (incorrect sequence in collection), it gives fatal error instead of saying clearly that "Unable to get the CSV file".
Caused situation where record gets created, we can see all properties from Storage service, but none from Search service.
Nearly impossible to figure out, for average Data Loader (user).
Next, imagine we are ingesting 1000 rows from source CSV and problem occurs in row-253 and row-455.
User's expectation is that CSV Ingestion program should pinpoint and clearly indicate row number and type of problem which caused the failure.
cc @chad , @tdixonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108Manifest by reference : error while DAG run2022-11-11T12:26:55ZDevdatta SantraManifest by reference : error while DAG run**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_r...**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_reference', 'validate_manifest_schema_task', '2022-10-13T08:56:41.095723+00:00', '--job-id', '13024', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/osdu-ingest-by-reference-r3.py', '--cfg-path', '/tmp/tmpv4ta88jt', '--error-file', '/tmp/tmpkxyyt6ok']
[2022-10-13 08:56:52,288] {standard_task_runner.py:77} INFO - Job 13024: Subtask validate_manifest_schema_task
[2022-10-13 08:56:52,390] {logging_mixin.py:104} INFO - Running <TaskInstance: Osdu_ingest_by_reference.validate_manifest_schema_task 2022-10-13T08:56:41.095723+00:00 [running]> on host ***-worker-0.***-worker.osdu.svc.cluster.local
[2022-10-13 08:56:52,509] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=Osdu_ingest_by_reference
AIRFLOW_CTX_TASK_ID=validate_manifest_schema_task
AIRFLOW_CTX_EXECUTION_DATE=2022-10-13T08:56:41.095723+00:00
AIRFLOW_CTX_DAG_RUN_ID=83247382-218b-44b5-b1c1-0b921ee67dd6
[2022-10-13 08:57:04,974] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/validate_manifest_schema_by_reference.py", line 110, in execute
manifest_data = self._get_manifest_data_by_reference(context=context,
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/mixins/ReceivingContextMixin.py", line 105, in _get_manifest_data_by_reference
retrieval_content_url = retrieval.json()["delivery"][0]["retrievalProperties"]["signedUrl"]
KeyError: 'delivery'
[2022-10-13 08:57:04,977] {taskinstance.py:1544} INFO - Marking task as FAILED. dag_id=Osdu_ingest_by_reference, task_id=validate_manifest_schema_task, execution_date=20221013T085641, start_date=20221013T085652, end_date=20221013T085704
```
It would be very helpful to get any resolution regarding this.
======================================
Updates about the new errors encountered:-
1) `AttributeError: 'dict' object has no attribute 'to_JSON'` - as mentioned in this below comment
https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108#note_159282
2) "Schema is not present" error from Dataset service while running the DAG
```
2022-10-19 12:03:14.191 DEBUG 1 --- [nio-8080-exec-1] .m.m.a.ExceptionHandlerExceptionResolver : Using @ExceptionHandler org.opengroup.osdu.dataset.util.GlobalExceptionMapper#handleAppException(AppException)
2022-10-19 12:03:14.193 WARN 1 --- [nio-8080-exec-1] o.o.o.c.common.logging.DefaultLogWriter : dataset-registry.app: Schema is not present
AppException(error=AppError(code=404, reason=Schema Service: get 'opendes:wks:dataset--File.Generic:1.0.0', message=Schema is not present, errors=null, debuggingInfo=null, originalException=null), originalException=null)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.validateDatasets(DatasetRegistryServiceImpl.java:233)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.createOrUpdateDatasetRegistry(DatasetRegistryServiceImpl.java:112)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi.createOrUpdateDatasetRegistry(DatasetRegistryApi.java:66)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$FastClassBySpringCGLIB$$774ab2c5.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:123)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:61)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$EnhancerBySpringCGLIB$$649af8f9.createOrUpdateDatasetRegistry(<generated>)
```Valentin GauthierValentin Gauthier