Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2023-08-17T09:04:44Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/core-external-data-workflow/-/issues/27ADR: Keep Only DAG Files in the DAG Folder2023-08-17T09:04:44ZNisha ThakranADR: Keep Only DAG Files in the DAG Folder
**Introduction**:
The purpose of this ADR is to review and approve the proposed changes to the Directed Acyclic Graph (DAG) structure and the relocation of Python files to the osdu_airflow package. The objectives of the proposed change...
**Introduction**:
The purpose of this ADR is to review and approve the proposed changes to the Directed Acyclic Graph (DAG) structure and the relocation of Python files to the osdu_airflow package. The objectives of the proposed changes are to enhance the performance of the DAG and to avoid the unnecessary import issues.
DAG files of the CSP.
src/dags/eds_scheduler/eds_scheduler_dag.py and
src/dags/eds_ingest/src_dags_fetch_ingest_scheduler_dag.py
**Purpose of Restructuring:**
- Mitigate Potential Import Issues: By reframing the DAG structure and organizing the Python files into a coherent package like osdu_airflow, potential import issues can be mitigated. This ensures that the import statements in the DAGs and related modules accurately reflect the new directory structure, reducing the likelihood of import errors and improving the overall stability of the system.
- Enhance Performance: One of the benefits of reframing the DAG structure is to improve the performance of the DAGs. When the python files are organized within a specific package, such as osdu_airflow, the Airflow scheduler can focus on parsing and scheduling only the relevant DAG files during each run. This eliminates the need for the scheduler to parse all other files, reducing unnecessary processing time and enhancing the overall efficiency of the scheduling process.
**Status**
- [x] Proposed
- [X] Trialing
- [X] Under review
- [X] Approved
- [ ] Retired
**Scope**
The scope of this ADR includes the following scenario:
- Reframing the Directed Acyclic Graph (DAG) structure: Only keeping DAG files in the folder.
- Moving the Python files associated with the DAGs to the osdu_airflow package and accessed from it.
**Current DAG Structure:**
![image](/uploads/847d7e4e8ab86395fa1a5f9c71c510bd/image.png)
**To Be-Structure:**
**DAG FOLDER:**
![image](/uploads/0d93c189db83e516f0a5da268aa9606b/image.png)
Python Package structure in the OSDU_AIRFLOW:
![image](/uploads/57f8c6419dc09d513dea844f3a0a9137/image.png)
**Implementation:**
- Keep all the python packages in the osdu_airflow repository within the folder structure.
- Create package registry for the osdu_airflow lib using CI/CD pipeline.
- Create a branch at the repository(https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/tree/master/)
- Push the code to the branch.
- CI/CD will run and create the dev package registry.
![image](/uploads/e3a5dd7f5c2c1aa70ee7678972c442c3/image.png)
- Within the dev environment refer to the package registry version, under the Required libraries from the Python Package Index (PyPI)
Eg: ![image](/uploads/fa29421ceb5f60f1397c018ef3750d83/image.png)
**Technical Changes Required:**
- Update Import Statements:
- Modify the import statements to import the required Python files
from the osdu_airflow package instead of the previous directory
structure.
Eg: from osdu_airflow.eds.eds_scheduler.eds_email_automation
import EmailAutomation.
**Changes/Impacts of Restructuring:**
- Code Split: EDS code except than DAG files will be moved from current repository(https://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/core-external-data-workflow/-/tree/master/src/dags) to new one (https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/tree/master/). Within open forum
- Deployment of the module will be challenging as right now it’s a manual job to add the package registry version to the dev environment, there might be a need to make some changes to existing CI/CD pipeline to automate the deployment.M19 - Release 0.22Yan Sushchynski (EPAM)Yan Sushchynski (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-ingestion-lib/-/issues/10Manifest ingestion fail on non-file-based datasets2023-04-26T17:17:49ZLaurent DenyManifest ingestion fail on non-file-based datasetsThe validation script validate_file_source.py, is rejecting datasets that are not matching the hardcoded types:
* FILE = ":dataset--File."
* FILE_COLLECTION = ":dataset--FileCollection."
* EDS_FILE = ":dataset--ConnectedSource."
Other d...The validation script validate_file_source.py, is rejecting datasets that are not matching the hardcoded types:
* FILE = ":dataset--File."
* FILE_COLLECTION = ":dataset--FileCollection."
* EDS_FILE = ":dataset--ConnectedSource."
Other dataset types that are not associated with files, should not be checked for any of the file attributes such as `FileSource`. For example, the following manifest that contains an ETP dataset should be bypassing the file associated tests.
```json
{
"kind": "osdu:wks:Manifest:1.0.0",
"Data": {
"Datasets": [
{
"acl": {
"viewers": [
"data.default.viewers@opendes.contoso.com"
],
"owners": [
"data.default.owners@opendes.contoso.com"
]
},
"kind": "osdu:wks:dataset--ETPDataspace:1.0.0",
"legal": {
"legaltags": [
"opendes-ReservoirDDMS-Legal-Tag"
],
"otherRelevantDataCountries": [
"US",
"UK"
]
},
"createTime": "2023-03-21T16:33:19.651Z",
"modifyTime": "2023-03-21T16:33:19.651Z",
"id": "opendes:dataset--ETPDataspace:M16_Demo-Volve_Reservoir",
"version": 1,
"data": {
"ExistenceKind": "opendes:reference-data--ExistenceKind:Actual:",
"DatasetProperties": {
"URI": "eml:///dataspace('M16_Demo/Volve_Reservoir')"
},
"Name": "M16_Demo/Volve_Reservoir"
}
}
]
}
}
```M17 - Release 0.20Chad LeongChad Leonghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108Manifest by reference : error while DAG run2022-11-11T12:26:55ZDevdatta SantraManifest by reference : error while DAG run**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_r...**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_reference', 'validate_manifest_schema_task', '2022-10-13T08:56:41.095723+00:00', '--job-id', '13024', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/osdu-ingest-by-reference-r3.py', '--cfg-path', '/tmp/tmpv4ta88jt', '--error-file', '/tmp/tmpkxyyt6ok']
[2022-10-13 08:56:52,288] {standard_task_runner.py:77} INFO - Job 13024: Subtask validate_manifest_schema_task
[2022-10-13 08:56:52,390] {logging_mixin.py:104} INFO - Running <TaskInstance: Osdu_ingest_by_reference.validate_manifest_schema_task 2022-10-13T08:56:41.095723+00:00 [running]> on host ***-worker-0.***-worker.osdu.svc.cluster.local
[2022-10-13 08:56:52,509] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=Osdu_ingest_by_reference
AIRFLOW_CTX_TASK_ID=validate_manifest_schema_task
AIRFLOW_CTX_EXECUTION_DATE=2022-10-13T08:56:41.095723+00:00
AIRFLOW_CTX_DAG_RUN_ID=83247382-218b-44b5-b1c1-0b921ee67dd6
[2022-10-13 08:57:04,974] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/validate_manifest_schema_by_reference.py", line 110, in execute
manifest_data = self._get_manifest_data_by_reference(context=context,
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/mixins/ReceivingContextMixin.py", line 105, in _get_manifest_data_by_reference
retrieval_content_url = retrieval.json()["delivery"][0]["retrievalProperties"]["signedUrl"]
KeyError: 'delivery'
[2022-10-13 08:57:04,977] {taskinstance.py:1544} INFO - Marking task as FAILED. dag_id=Osdu_ingest_by_reference, task_id=validate_manifest_schema_task, execution_date=20221013T085641, start_date=20221013T085652, end_date=20221013T085704
```
It would be very helpful to get any resolution regarding this.
======================================
Updates about the new errors encountered:-
1) `AttributeError: 'dict' object has no attribute 'to_JSON'` - as mentioned in this below comment
https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108#note_159282
2) "Schema is not present" error from Dataset service while running the DAG
```
2022-10-19 12:03:14.191 DEBUG 1 --- [nio-8080-exec-1] .m.m.a.ExceptionHandlerExceptionResolver : Using @ExceptionHandler org.opengroup.osdu.dataset.util.GlobalExceptionMapper#handleAppException(AppException)
2022-10-19 12:03:14.193 WARN 1 --- [nio-8080-exec-1] o.o.o.c.common.logging.DefaultLogWriter : dataset-registry.app: Schema is not present
AppException(error=AppError(code=404, reason=Schema Service: get 'opendes:wks:dataset--File.Generic:1.0.0', message=Schema is not present, errors=null, debuggingInfo=null, originalException=null), originalException=null)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.validateDatasets(DatasetRegistryServiceImpl.java:233)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.createOrUpdateDatasetRegistry(DatasetRegistryServiceImpl.java:112)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi.createOrUpdateDatasetRegistry(DatasetRegistryApi.java:66)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$FastClassBySpringCGLIB$$774ab2c5.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:123)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:61)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$EnhancerBySpringCGLIB$$649af8f9.createOrUpdateDatasetRegistry(<generated>)
```Valentin GauthierValentin Gauthierhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/prodml-parser/-/issues/4Bulk Dataset record IDs2022-10-19T10:06:01ZAugustin Pilard ZenBulk Dataset record IDsBe able to trigger the parsing given a list of dataset record IDsBe able to trigger the parsing given a list of dataset record IDshttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/8Explore/Research on apache parquet data format storage mechanism2022-10-31T10:26:11ZAshutosh KumarExplore/Research on apache parquet data format storage mechanismWe need to explore on the apache parquest data format so that we can convert the data retrieved from OPC UA server in to parquet format.We need to explore on the apache parquest data format so that we can convert the data retrieved from OPC UA server in to parquet format.Ashutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/7Collect values of Evergyvue server for a minute to process the data2022-10-31T10:26:43ZAshutosh KumarCollect values of Evergyvue server for a minute to process the data1: Connect with Energyvue server using milo sdk
2: Fetch node and respective values
3: Get these values for one min and view the data to process further.1: Connect with Energyvue server using milo sdk
2: Fetch node and respective values
3: Get these values for one min and view the data to process further.Ashutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/6Connect with OPC UA server and read node values using eclipse milo2022-10-31T10:27:23ZAshutosh KumarConnect with OPC UA server and read node values using eclipse miloConnect EnergyVue server using milo sdk and try to read node valuesConnect EnergyVue server using milo sdk and try to read node valuesAshutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/5Connect Everyvue server using OPC UA client (UAExpert and OPC UA browser)2022-07-06T03:57:37ZAshutosh KumarConnect Everyvue server using OPC UA client (UAExpert and OPC UA browser)Connect and see the files/folders structure of Energyvue server after connecting using:
1: OPC UAExpert
2: Prosys OPC UA browserConnect and see the files/folders structure of Energyvue server after connecting using:
1: OPC UAExpert
2: Prosys OPC UA browserAshutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/4Write sample code to connect to Energyvue server using eclipse milo2022-10-31T10:27:40ZAshutosh KumarWrite sample code to connect to Energyvue server using eclipse miloWrite sample code to connect to energyvue server using eclipse milo
The OPC server endpoint is
opc.tcp://demo.energyvue.com:62546/EnergyVue/OpcServer
Preferred security is shown below and should be automatically adopted by the server ...Write sample code to connect to energyvue server using eclipse milo
The OPC server endpoint is
opc.tcp://demo.energyvue.com:62546/EnergyVue/OpcServer
Preferred security is shown below and should be automatically adopted by the server if you support it.
Mode: Sign & Encrypt
Policy: Aes256Sha256RsaPssAshutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/3Write/Explore Sample code to connect to Eclipse milo server using eclipse mil...2022-07-06T03:59:29ZAshutosh KumarWrite/Explore Sample code to connect to Eclipse milo server using eclipse milo sdkDownload eclipse milo sdk code and try to write and execute program to connect to Eclipse milo server
opc.tcp://milo.digitalpetri.com:62541/milo
Also try to connect Using:
1: Unified automation UAExpert
2: Using Eclipse milo sdk.Download eclipse milo sdk code and try to write and execute program to connect to Eclipse milo server
opc.tcp://milo.digitalpetri.com:62541/milo
Also try to connect Using:
1: Unified automation UAExpert
2: Using Eclipse milo sdk.Ashutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/2Explore the OPC UA client and server2022-07-06T03:59:50ZAshutosh KumarExplore the OPC UA client and server1: Explore OPC UA Client and server architecture.
2: Check the communication methods between them.1: Explore OPC UA Client and server architecture.
2: Check the communication methods between them.M14 - Release 0.17Ashutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/opc-ua-ingestion/-/issues/1Exploring OPC-UA open source SDK2022-07-06T04:00:23ZChad LeongExploring OPC-UA open source SDKEvaluating different OPC-UA open-source client SDK options
1. Eclipse Milo
2. OPC UA client SDKEvaluating different OPC-UA open-source client SDK options
1. Eclipse Milo
2. OPC UA client SDKAshutosh KumarAshutosh Kumarhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/52ADR: Enabling User Context in Ingestion2024-02-06T04:43:37Zharshit aggarwalADR: Enabling User Context in Ingestion
## Status
- [ ] Proposed
- [ ] Trialing
- [ ] Under review
- [X] Approved
- [ ] Retired
## Problem Statement
Currently Ingestion Jobs in OSDU like CSV Parser, Manifest Ingestion etc. uses Service Account Token while calling any OSDU S...
## Status
- [ ] Proposed
- [ ] Trialing
- [ ] Under review
- [X] Approved
- [ ] Retired
## Problem Statement
Currently Ingestion Jobs in OSDU like CSV Parser, Manifest Ingestion etc. uses Service Account Token while calling any OSDU Service APIs like Storage/Dataset, which means any Authorization checks happening for API Access or Data Level Access (ACL checks in Storage service) is based on permission level of Service Account rather than based on the User who initiated the Ingestion in the first place.
Therefore, the Users indirectly gets highest level of permissions in OSDU which can be used to modify data of other users in the system (a scenario from CSV Parser will be discussed later to understand the issue better). This problem is not just specific to Ingestion but can be true for any service which performs long running jobs and is relying on Service Account Token. For rest of this ADR we will discuss Ingestion Scenario and related flows to highlight the problem and solution, but as said it can be applicable to OSDU in general.
![image](/uploads/4164ef365abd7c19a48633da5bfe6e5e/image.png)
This is the similar problem raised in the Issue - https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/merge_requests/219
And proposed temporary fix - https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/merge_requests/219
This ADR is targeting to solve it holistically for similar use cases
_**Note – In this ADR all the references to Service Account Token means the Service Account Token used by OSDU Services internally for any Service to Service calls, so this is kind of a privileged user. We are not referring to any external Service accounts which might be getting used by Clients for external applications**_
**Questions**
- _**Why can't these long running Jobs rely on User Tokens passed in request Headers by the End User?**_ - Because of long running nature of jobs Tokens will eventually expire and would require renewal, and System can't renew these User Tokens as it can't have access to User specific Credentials and Auth Codes
**Scenario to Understand this Issue**
In CSV Ingestion IDs of Record generated are predetermined by using Natural Keys [[Code Link](https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/blob/master/csv-parser-core/src/main/java/org/opengroup/osdu/csvparser/handler/handlers/IdHandler.java)]
And there is possibility of two users ingesting records with same Ids
Now say User A invoked CSV Ingestion where he tried to ingest an existing record again (created by a different User B with xyz ACLs). User A who is trying to effectively update this record may not have access to the ACLs associated with it but since the Ingestion Job uses Service Account Token, ACL validations will succeed in Storage service Create/Update Record flow (Service Accounts are part of `users.data.root` group which gives them access over entire data in the system)
As a result User A updated records created by User B resulting in data loss for original user which is not expected behavior and a major gap in Authorization. Similar issue can happen in Manifest Ingestion as well, if a user tried to Ingest Manifest with same Record IDs.
**Key Issue to Address**
- Any Flow which is invoked by an External User should preserve user context and use the same for Authorization, hence for any User to Service Call and subsequent internal Service to Service Calls should use user identity (context)
- All the asynchronous flows like Indexer Queue, WKS, Notification etc. can still rely on Internal Service Account for Authorization as these are some background system operations and not user driven
## Proposed Solution
We can leverage the SPI Layer (Service Mesh/API Gateway) responsible for Authentication & Identity Resolution in this scenario. So as part of Entitlements V2 onboarding, Authentication and identity resolution was extracted from Entitlements service and service expects the identity to be provided to it in the requests. The x-user-identity header is an expected header on the requests into the service. This header provides the identity of the user in the request and is set by the SPI Layer (Service Mesh/Gateway)
**Service Side Changes**
- A new header x-on-behalf-of will be introduced which will store the user identity (context)
- Workflow Service will add a new field user identity (present in DPS Headers) to Airflow `Conf` while triggering the Dag Run
- Ingestion Jobs (CSV/Manifest) will extract the newly added user identity field from Airflow Context and then set x-on-behalf-of header in the requests before calling any downstream services
**Change in SPI Layer (Service Mesh)**
- If the request contains Internal Service Account Token and x-on-behalf-of header is not empty or null, then the x-user-id header will be set to x-on-behalf-of header
- Else set the x-user-id header by existing logic
This allows preserving the User Identity (Context) and hence all API Level and Data Level Authorization checks will be performed based on Entitlement Groups of the User rather than Service Account
Authentication can still be carried out using Service Account token as the User was already authenticated when they triggered Workflow API
#### Scope of Proposed Solution
- The above proposed Solution is for Trusted DAGs only and not for any custom private DAGs. So when we say trusted we mean piece of Code which is reviewed and signed off by all Stakeholders like code for OSDU services, so trusted DAGs here will include all the community DAGs which are present like CSV Parser, Manifest, SGY, VDS, WITSML etc.
- Going forward we might want to support clients bringing in their own custom private DAGs, then the obvious question is how will we enforce those new DAGs/Ingestors adhere to the proposed guidelines and setting of x-behalf-of headers and things. Now we will anyways publish guidelines for new developers to follow while bringing new DAGs but enforcement angle is still missing and needs to be ironed out. Hence to clearly call out, this ADR is not focusing on handling custom private DAGs and a separate ADR and discussion will be proposed
**Advantages of Above Approach**
- ACL validations will be performed based on user-id instead of Service Account which resolves the elevated permissions issue
- Service side code changes are minimal and will be scoped to setting up request headers/payloads
- Implementation of adding the header logic can be taken up each CSP as per their Infra
- Hardens the Authorization checks as it also ensures that only Users with appropriate API permissions will be able to Trigger Ingestion
- No change in service behavior expected
## Flow Diagram
### User to Service Calls and Service to Service Calls
![image](/uploads/49b9d09aa9a4abaebf3374241350bb5c/image.png)
### Asynchronous Flows
![image](/uploads/2d23fb70c1e087d5c54db5c68eb1518d/image.png)
## Security Enhancement
This is a proposed enhancement in extension to the above approach to further harden the security of the system.
Currently in OSDU deployment, the Internal Service Account Token can be directly used by clients to invoke OSDU APIs
The Privileged Service Account token should be restricted for usage by only internal services and external users shouldn’t be permitted to perform this operation. Customers can still create and use any other Service Account for external calls eg. for use cases like an external monitoring service etc. only restriction is on the usage of the internal account for external calls. This will block any malicious tempering of x-on-behalf-of header from outside and also increases security in case of accidental Service Account Token leak.
**Changes required**
- Mechanism to distinguish external calls vs internal calls, this can be handled by each CSP in their own Service Mesh Implementation
- Block any external calls made using Internal Service Account Token
- Details on response codes can be flushed out later
## FAQs
**Q - Is there any security Issue by sending the user Identifiers in plain text in request headers during service-to-service communication?**
A - Here CSPs can leverage their infrastructure and enable encryption of all the traffic sent between service containers in OSDU
**Q - What is the Service Account we are referring to in the ADR?**
In the ADR all the references to Service Account Token means the Service Account Token used by OSDU Services internally for any Service to Service calls, so this is kind of a privileged user. We are not referring to any external Service accounts/SPN which might be getting used by Clients for external applications
**Q - Ingestors (DAGs) can execute any piece of Code and the onus is on these DAGs to set x-behalf-of header with user identity so how do we ensure the header is indeed correctly set with right user identity?**
A - This question is from viewpoint of consumption i.e. how will these proposed changes in the design consumed by the DAGs, we would leverage core libraries for implementation of any Http clients and header manipulation logic in the DAGs. For instance we have osdu-airflow library getting used across all the DAGs, for csv parser we have a os-core-common java library etc. hence all the logic should be scoped to the common libraries and developers of new DAGs should consume these libraries only.
Now this still doesn't resolve the enforcement angle to it, like how we can enforce and reject any DAGs not adhering to this design, so as mentioned this is not in scope for this ADR
**Q - Do we need any validation in Service Mesh to ensure x-on-behalf-of header was correctly set by Ingestors?**
A - This is related with previous question, for scope of this ADR we are assuming DAGs and OSDU services are trusted and won't be manipulating the headers in any wrong manner
**Q - Will this handle Scenarios where an Ingestion Job may be execute outside of System Airflow Cluster for instance may be an external application or an external Airflow Cluster?**
A - In case of any external ingestion jobs, any outside calls to OSDU services using Internal Service Account will be blocked so the x-on-behalf-of flow won't be involved. Therefore existing logic of JWT claims extraction will be followed and these external jobs should either use User tokens or create a new service account for token generation and this will automatically take care of elevated privileges
**Q - Are we proposing blocking all Service Account calls from outside as part of Security Enhancement?**
A - No only the external calls using internal privileged Service Account will be blockedM18 - Release 0.21Om Prakash GuptaOm Prakash Guptahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/issues/12Use byte location information from Dataset (Filecollection.SegY) and use when...2022-12-13T00:04:18ZDebasis ChatterjeeUse byte location information from Dataset (Filecollection.SegY) and use when converting SegY to oVDS (GONRG-5217)@Yan_Sushchynski - As previously discussed, this change will bring parity for both converters (oZgy and oVDS).
Earlier @marius explained that default values can be overridden at run time by using suitable parameters.
Quote
Example -
--...@Yan_Sushchynski - As previously discussed, this change will bring parity for both converters (oZgy and oVDS).
Earlier @marius explained that default values can be overridden at run time by using suitable parameters.
Quote
Example -
--header-field InlineNumber=223:4 --header-field CrosslineNumber=73,4
The appropriate trace header fields would be in the FileCollection.SEGY record, in the VectorHeaderMapping[], where possible values are indicated in the table Reference Values for HeaderKeyName.1.0.0.
It is the DAG’s responsibility to find there appropriate values from the FileCollection.SEGY record, and use them to send the correct arguments to the SEGImport process.
(The OpenVDS SEGYImport tool is generalized, so it doesn’t know about OSDU databases, Work Products and FileCollection.SEGY records, alas, the DAG needs to facilitate)
The SEGYImport tool can be used on 2D, 3D and 4D datasets, so the DAGs should support these potential FileCollections and VectorHeaderMappings.
Unquote
Copying to @Keith_Wall , @Kateryna_Kurach for informationM13 - Release 0.16https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/issues/52Witsml parser - Wrong encoding for double colons - CoordinateReferenceSystem2022-08-23T13:29:38Zetienne peyssonWitsml parser - Wrong encoding for double colons - CoordinateReferenceSystemEven though the DAG passes, looking at Xcom logs we see skipped ids :
{'provide_manifest_integrity_task': [{'id': 'odesprod:work-product--WorkProduct:TEST_TRAJECTORY_002', 'kind': 'odesprod:wks:work-product--WorkProduct:1.0.0', 'reason'...Even though the DAG passes, looking at Xcom logs we see skipped ids :
{'provide_manifest_integrity_task': [{'id': 'odesprod:work-product--WorkProduct:TEST_TRAJECTORY_002', 'kind': 'odesprod:wks:work-product--WorkProduct:1.0.0', 'reason': 'Missing parents: {SRN: odesprod:work-product-component--WellboreTrajectory:TEST_TRAJECTORY_002}'}, {'id': 'odesprod:work-product-component--WellboreTrajectory:TEST_TRAJECTORY_002', 'kind': 'odesprod:wks:work-product-component--WellboreTrajectory:1.0.0', 'reason': 'Missing parents: {SRN: odesprod:reference-data--CoordinateReferenceSystem:odesprod%3Areference-data--CoordinateReferenceSystem%3AGeodeticCRS%253A%253AEPSG%253A%253A4230%3A, SRN: odesprod:reference-data--CoordinateReferenceSystem:GeodeticCRS%3A%3AEPSG%3A%3A4230}'}]}
The witsml file is the one for the trajectory and the concerned field the following :
```
<Crs xsi:type="eml:GeodeticLocalAuthorityCrs">
<eml:LocalAuthorityCrsName authority="EPSG">GeodeticCRS::EPSG::4230</eml:LocalAuthorityCrsName>
</Crs>
```
As you can see we need to fill in details about the ID of the reference data but as it contains double colons, it does not pass.M10 - Release 0.13https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/issues/48WITSML parser fails - expects parent entity with a specified version2022-01-19T19:26:40ZKateryna Kurach (EPAM)WITSML parser fails - expects parent entity with a specified versionI am trying to process the file (see attached). WITSML ingestion fails with the following error message:
` {'provide_manifest_integrity_task': [{'id': 'odesprod:work-product--WorkProduct:20C60DDC-D36D-4A3C-800F-504CE0B5605D', 'kind': 'o...I am trying to process the file (see attached). WITSML ingestion fails with the following error message:
` {'provide_manifest_integrity_task': [{'id': 'odesprod:work-product--WorkProduct:20C60DDC-D36D-4A3C-800F-504CE0B5605D', 'kind': 'odesprod:wks:work-product--WorkProduct:1.0.0', 'reason': 'Missing parents: {SRN: odesprod:work-product-component--WellboreTrajectory:20C60DDC-D36D-4A3C-800F-504CE0B5605D}'}, {'id': 'odesprod:work-product-component--WellboreTrajectory:20C60DDC-D36D-4A3C-800F-504CE0B5605D', 'kind': 'odesprod:wks:work-product-component--WellboreTrajectory:1.0.0', 'reason': 'Missing parents: {SRN: odesprod:dataset--File.WITSML:20C60DDC-D36D-4A3C-800F-504CE0B5605D:1}'}]} `
The problem is that it expects a dataset record with a version
`odesprod:dataset--File.WITSML:20C60DDC-D36D-4A3C-800F-504CE0B5605D:1`
, but creates a dataset record without a version.[Trajectory.xml](/uploads/a43e17d20edcab84859f7cb49f72687c/Trajectory.xml)M10 - Release 0.13Laurent Denyetienne peyssonLaurent Denyhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/issues/43IBM R3M8 - Failure to ingest Wellbore data from WITSML source2022-08-23T13:29:38ZDebasis ChatterjeeIBM R3M8 - Failure to ingest Wellbore data from WITSML sourceReported by @epeysson .
From Airflow log, we see the following as reason of failure.
```
"Missing referential id: {
'opendes:reference-data--VerticalMeasurementType:TotalDepth:',
'opendes:reference-data--VerticalMeasurementPath:Measur...Reported by @epeysson .
From Airflow log, we see the following as reason of failure.
```
"Missing referential id: {
'opendes:reference-data--VerticalMeasurementType:TotalDepth:',
'opendes:reference-data--VerticalMeasurementPath:MeasuredDepth:',
'opendes:reference-data--VerticalMeasurementPath:TrueVerticalDepth:',
```
Input source [Etienne-Wellbore.xml](/uploads/dcc04d22a77be52599a40ef3ae6f487d/Etienne-Wellbore.xml)
After some investigation, it was determined that OSDU Reference value convention has changed to use code instead.
reference-data--VerticalMeasurementType:TD instead of TotalDepth.
reference-data--VerticalMeasurementPath:TVD instead of TrueVerticalDepth
reference-data--VerticalMeasurementPath:MD instead of MeasuredDepth
One possible solution is to change code for such changes.
Another alternative can be to hold field mapping in some configuration file so that it is easy to handle this kind of change in the future.
For other data types (Well, Marker, Log, Trajectory, Tubular), there will be potentially other mismatches too.
Copying to @janas712 for her input on the subject
Also adding @gehrmann for his awareness
cc - @ChrisZhang , @chad , @Keith_Wall for informationM10 - Release 0.13etienne peyssonetienne peyssonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/56GSM Integration2021-10-06T10:32:28ZFernando Nahu Cantera RubioGSM IntegrationCSV Parser Integration with GSM, now we can get the details of the failure for records as well as jobs for all the CSV Ingestion runs, with proper error message and errorcodesCSV Parser Integration with GSM, now we can get the details of the failure for records as well as jobs for all the CSV Ingestion runs, with proper error message and errorcodesM9 - Release 0.12Fernando Nahu Cantera RubioFernando Nahu Cantera Rubiohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/91Date-Time validation causing ingestion failure2022-03-21T15:29:30ZKeith WallDate-Time validation causing ingestion failureDate values are failing schema validation on ingestion if the dates are not in UTC, or don't contain a time-zone offset.
This is a new validation, that requires date-times to conform to RFC3339. The intent is good, but it does not confo...Date values are failing schema validation on ingestion if the dates are not in UTC, or don't contain a time-zone offset.
This is a new validation, that requires date-times to conform to RFC3339. The intent is good, but it does not conform to schemas, or to our data.
There is a large volume of date in which a date is known, but for which no time or time zone is provided. Recognizing this, the OSDU schema only require that dates be a string.
As a large volume of data is managed that does not have time zone information, our options are either to reject all these dates, or to ingest and maintain them in original format.
If we force a time zone change on data by putting it into a UTC format when we really do not know the time zone, we are corrupting the data.
I have consulted the Enterprise Architecture Geomatics team, and asked if we should (1) Do not load dates with unknown time zones or (2) maintain the dates in as-provided form. There was complete agreement that the industry has a large volume of data with dates without time zones, and we can still make use of those dates, but must not modify them by adding a default time zone.
Please remove the date validation from ingestion.M9 - Release 0.12Kishore BattulaShrikant GargSpencer Suttonsuttonsp@amazon.comYan Sushchynski (EPAM)Kishore Battulahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/87WIP - Airflow 2+ Adoption2022-03-16T07:33:32ZBen LasscockWIP - Airflow 2+ AdoptionThis issue is a place to track adoption of Airflow 2+ by the various CSP's.
Running Airflow 2+ with the experimental API is backward compatibility with the current workflow services. However it does provide potential performance improv...This issue is a place to track adoption of Airflow 2+ by the various CSP's.
Running Airflow 2+ with the experimental API is backward compatibility with the current workflow services. However it does provide potential performance improvements, particularly around the scheduler. Please update you current status here.
AWS - M9 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)
Azure - M10 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)
IBM - M9 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)
GCP - M8 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)