Manifest Ingestion DAG issueshttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues2023-02-13T08:56:59Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/109Manifest ingestion by Reference - error while running DAG for first time2023-02-13T08:56:59ZNaveen RamachandraiahManifest ingestion by Reference - error while running DAG for first timeTeam,
For Azure, we are trying to implement feature Manifest by reference but getting issues while running DAG. Attached the error log and the screenshot of the DAG Graph. Please do help[DAG_-error.log](/uploads/80013a342d3e6d5fbfd843f...Team,
For Azure, we are trying to implement feature Manifest by reference but getting issues while running DAG. Attached the error log and the screenshot of the DAG Graph. Please do help[DAG_-error.log](/uploads/80013a342d3e6d5fbfd843fcf27c0707/DAG_-error.log)![DAG-_tree](/uploads/e09e4f01cfb6b1a4166a4df2efa83e4d/DAG-_tree.png)M16 - Release 0.19Jayesh BagulJayesh Bagulhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108Manifest by reference : error while DAG run2022-11-11T12:26:55ZDevdatta SantraManifest by reference : error while DAG run**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_r...**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_reference', 'validate_manifest_schema_task', '2022-10-13T08:56:41.095723+00:00', '--job-id', '13024', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/osdu-ingest-by-reference-r3.py', '--cfg-path', '/tmp/tmpv4ta88jt', '--error-file', '/tmp/tmpkxyyt6ok']
[2022-10-13 08:56:52,288] {standard_task_runner.py:77} INFO - Job 13024: Subtask validate_manifest_schema_task
[2022-10-13 08:56:52,390] {logging_mixin.py:104} INFO - Running <TaskInstance: Osdu_ingest_by_reference.validate_manifest_schema_task 2022-10-13T08:56:41.095723+00:00 [running]> on host ***-worker-0.***-worker.osdu.svc.cluster.local
[2022-10-13 08:56:52,509] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=Osdu_ingest_by_reference
AIRFLOW_CTX_TASK_ID=validate_manifest_schema_task
AIRFLOW_CTX_EXECUTION_DATE=2022-10-13T08:56:41.095723+00:00
AIRFLOW_CTX_DAG_RUN_ID=83247382-218b-44b5-b1c1-0b921ee67dd6
[2022-10-13 08:57:04,974] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/validate_manifest_schema_by_reference.py", line 110, in execute
manifest_data = self._get_manifest_data_by_reference(context=context,
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/mixins/ReceivingContextMixin.py", line 105, in _get_manifest_data_by_reference
retrieval_content_url = retrieval.json()["delivery"][0]["retrievalProperties"]["signedUrl"]
KeyError: 'delivery'
[2022-10-13 08:57:04,977] {taskinstance.py:1544} INFO - Marking task as FAILED. dag_id=Osdu_ingest_by_reference, task_id=validate_manifest_schema_task, execution_date=20221013T085641, start_date=20221013T085652, end_date=20221013T085704
```
It would be very helpful to get any resolution regarding this.
======================================
Updates about the new errors encountered:-
1) `AttributeError: 'dict' object has no attribute 'to_JSON'` - as mentioned in this below comment
https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108#note_159282
2) "Schema is not present" error from Dataset service while running the DAG
```
2022-10-19 12:03:14.191 DEBUG 1 --- [nio-8080-exec-1] .m.m.a.ExceptionHandlerExceptionResolver : Using @ExceptionHandler org.opengroup.osdu.dataset.util.GlobalExceptionMapper#handleAppException(AppException)
2022-10-19 12:03:14.193 WARN 1 --- [nio-8080-exec-1] o.o.o.c.common.logging.DefaultLogWriter : dataset-registry.app: Schema is not present
AppException(error=AppError(code=404, reason=Schema Service: get 'opendes:wks:dataset--File.Generic:1.0.0', message=Schema is not present, errors=null, debuggingInfo=null, originalException=null), originalException=null)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.validateDatasets(DatasetRegistryServiceImpl.java:233)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.createOrUpdateDatasetRegistry(DatasetRegistryServiceImpl.java:112)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi.createOrUpdateDatasetRegistry(DatasetRegistryApi.java:66)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$FastClassBySpringCGLIB$$774ab2c5.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:123)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:61)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$EnhancerBySpringCGLIB$$649af8f9.createOrUpdateDatasetRegistry(<generated>)
```Valentin GauthierValentin Gauthierhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/105Performance testing in R3 M11 - Need to determine the maximum size of the pa...2024-03-20T15:49:55ZKamlesh TodaiPerformance testing in R3 M11 - Need to determine the maximum size of the payload allowed during ingestion using Osdu_ingest DAGAll,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0
The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used ...All,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0
The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used the time required to ingest the data goes down ( performance gain)
Some observations of the process used.
There is a difference in the python scripts that are used to generate the payload for Ingestion and batch_upload.
The python script that generates the payload for ingestion, generates records of kind: “opendes:wks:master-data--Organisation:1.0.0”. So when user specifies 5 records, it generates 5 records for kind Organisation
The python script that generates the payload for batch_upload, generates records of "osdu:wks:master-data--Organisation:1.0.0 and osdu:wks:reference-data--ContractorType:1.0.0". So when a user specifies 5 records, it generates records of kind
Organisation and ContractorType. So it is actually generating twice the amount of records that were specified.
At present to establish the benchmark for the performance, we are using the number of records. Probably because it is convenient to tell the users that to ingest a certain number of wells for example takes x amount of time.
But the well record size may vary from one user environment to another and hence performance numbers derived using a number of records may not hold true in all the situations.
How much one can ingest in one job or one time is based on the size of the payload in KB. So I think we should use the payload size in KB to establish the benchmark. The number of records that can fit in the payload would depend on the size of the records
I have done the testing in the IBM environment, but the test for 50000 records in batch_upload seems to be failing in all the environments.
I do not know, where the size limit is coming from? REST API, Network, Airflow, DAG implementation.
Nor do I know whether that size is configurable?
It is important for us to understand where that limitation is coming from and whether it is a hard limit or a configurable limit.
The python script file should honor that limit and generate data/payload files (multiple) containing a correct number of records to avoid failures.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/104Manifest ingestion DAG is not creating master-data--Wellbore2022-11-11T12:29:06ZThomas DombrowskyManifest ingestion DAG is not creating master-data--WellboreWhen running the manifest ingestor DAG with the attached payload, no records are inserted into the storage.
The Airflow logs show no error, so it is unknown why the ingestion fails.
Expected: The manifest contains a single record. The r...When running the manifest ingestor DAG with the attached payload, no records are inserted into the storage.
The Airflow logs show no error, so it is unknown why the ingestion fails.
Expected: The manifest contains a single record. The record should be inserted into storage during the ingestion.
Expected: The Airflow logs need to be improved. Logs should show the payload that was received and what processing has occurred. If there are errors that prevent the ingestion of data, these should be fully logged.
[wellbore-wks_sandbox.json](/uploads/10c1885dffa4288891cb48a13ad1e489/wellbore-wks_sandbox.json)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/103Manifest Based Ingestion - Operator Performance Benchmarking for M10 Release2023-10-23T07:51:06ZDevendra RawatManifest Based Ingestion - Operator Performance Benchmarking for M10 ReleasePurpose of this WIP issue is to plan around how we are going to benchmark ingestion performance. Testing to be performed on Data Example from Reference Data, Master Data (Wells, Wellbores), and Work Product Components (Trajectory), etc.
...Purpose of this WIP issue is to plan around how we are going to benchmark ingestion performance. Testing to be performed on Data Example from Reference Data, Master Data (Wells, Wellbores), and Work Product Components (Trajectory), etc.
This is to gauge the operator acceptance on performance upgrade as benchmarked in [issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/101). The upgrade has shown significant improvement in throughput and speed as highlighted.
| Manifest Type | Operator 1 |Operator 2 | Operator 3 |Operator 4 | Operator 5 |Operator 6 |
| ------ | ------ | ------| ------ | ------| ------ | ------|
| Reference Data |
| Master Data |
| Work Product Component |Devendra RawatDevendra Rawathttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/98WIP - Performance Benchmarking2023-10-23T07:51:46ZBen LasscockWIP - Performance BenchmarkingThe purpose of this WIP issue is plan around how we are going to benchmark ingestion performance.
Need to address the performance of ingestion mechanisms.
Expected performance in production env is in excess of 33k records per minute - W...The purpose of this WIP issue is plan around how we are going to benchmark ingestion performance.
Need to address the performance of ingestion mechanisms.
Expected performance in production env is in excess of 33k records per minute - Wells, well logs, trajectories etc...
- [ ] @Devendra_R @npickus to connect with @todaiks & @chad if possible to confirm testing approach, timing and feedback cycles
Data examples from real use cases include wells, wellbores, trajectories, etc.. (already using tno and volve)
- [ ] Nick to check with Chevron teams to see if there is an opportunity for the teams to schedule a test of the current manifest ingestion in a real production environment to compare to current test rates within the Forum. Team can use Script from Jean Rainauld and test with the same synthetic data as the forum tests.
[testing info](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/merge_requests/17)
- [ ] @debasisc to follow up with CSPs to gain alignment on CSPs testing ingestion in their environments
*Issues
- No defined custodians & developers for Manifest ingestion
- data sets used for testing not representative of real data - only master data
- testing requires close coordination with CSPs -
## Load Testing & Performance
[performance changes since M6](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/101)
[load testing Issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/80)
*Old test results
[M7](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/tree/master/TeamD_M7/ManifestLoadTesting), [M8](https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M8/Results/OSDU_LoadTesting_Results_M8_TeamD.xlsx)
### Basic Load testing
Testing the ingestion of 500, 1000, 50,000 record manifests.
Uses synthetic manifests to perform basic testing of the ingestion. Load testing is run by
pre shipping for each release, one release in arreas, which means MX is testing during the
development cycle of M(X+1). A spreadsheet showing the pass/fail and timing per CSP is provided
by pre shipping on conclusion of the test.
Using the Airflow console, additional information regarding run time and latency of the Airflow scheduler can be found using the Airflow console [Gantt Chart](https://airflow.apache.org/docs/apache-airflow/stable/ui.html). This data provides a view of where the performance bottlenecks might be.
Assets:
- [x] Basic load testing passed True/False (per release)
- [x] Timing information (scaling as a function of manifest size).
- [ ] Snap shots of the Airflow [Gantt Chart](https://airflow.apache.org/docs/apache-airflow/stable/ui.html)
### Advanced Load testing
The the other items (beyond the basic) [load testing](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/80) give insights into the sensitivity of the performance on the Airflow configuration. This maybe carried out if resources are available.
## Defining standards @npickus
Today teams are loading 8-10 Million records, including validations, outside of the Manifest or CSV ingestion mechanisms in ~5 hours for a rate of about 33k Records per minute.
- [x] Collect x2 user stories from operators.
- [x] Define OSDU EA/community expectations.
## For applications developers (local mode) @epeysson
There is a use case for applications developers to run the Airflow part of the ingestion locally (with core services provided accessible through a REST API). This local mode can be used to better profile the performance of just the Airflow component, agnostic.
Standalone installation instructions are [here](https://community.opengroup.org/osdu/platform/deployment-and-operations/individual-airflow).
- [ ] Complete basic load testing with standalone Airflow.Devendra RawatDevendra Rawathttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/97"Broken DAG for manifest ingestion.2022-01-23T10:58:41ZAbhijeet Sawant"Broken DAG for manifest ingestion.Airflow UI showing import error- "Broken DAG: [/opt/airflow/dags/manifest_ingestion_dags.zip] No module named 'osdu_ingestion.libs.auth'"
Image deployed - repository: msosdu.azurecr.io/airflow-docker-image tag: v0.10![airflow_import_err...Airflow UI showing import error- "Broken DAG: [/opt/airflow/dags/manifest_ingestion_dags.zip] No module named 'osdu_ingestion.libs.auth'"
Image deployed - repository: msosdu.azurecr.io/airflow-docker-image tag: v0.10![airflow_import_error](/uploads/ecc51fc6ba34574bc852832ee0349177/airflow_import_error.JPG)
Continuous alerts are getting triggered.Kishore BattulaKishore Battulahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/96Integration E2E Tests for manifest ingestion - IBM2022-11-11T12:30:22ZChris ZhangIntegration E2E Tests for manifest ingestion - IBMThis is to track the IBM team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85This is to track the IBM team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85M12 - Release 0.15Shrikant Gargjingdong sunShrikant Garghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/95Integration E2E Tests for manifest ingestion - MSFT2021-11-17T02:46:12ZChris ZhangIntegration E2E Tests for manifest ingestion - MSFTThis is to track the MSFT team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85This is to track the MSFT team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85M10 - Release 0.13Krishnan GanesanKrishnan Ganesanhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/94Integration E2E Tests for manifest ingestion - AWS2022-08-24T14:48:52ZChris ZhangIntegration E2E Tests for manifest ingestion - AWSThis is to track the AWS team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85This is to track the AWS team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85M10 - Release 0.13Gustavo UrdanetaGustavo Urdanetahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/93Airflow log reports success when failed2021-10-29T08:57:50ZJan MortensenAirflow log reports success when failed**Technical Context**
* Deployed version: _M8 aka release/0.11_
* DAG: _Osdu_ingest_
* Task: _process_single_manifest_file_task_
**Description**
When running a manifest ingestion we got some errors, but the specific task completed succe...**Technical Context**
* Deployed version: _M8 aka release/0.11_
* DAG: _Osdu_ingest_
* Task: _process_single_manifest_file_task_
**Description**
When running a manifest ingestion we got some errors, but the specific task completed successfully. This can be confusing when trying to debug.
Attaching the last part of the log which shows both the error and the "Marking task as SUCCESS".
>[2021-10-25 12:39:34,130] {connectionpool.py:230} DEBUG - Starting new HTTP connection (1): storage.osdu-azure.svc.cluster.local:80
[2021-10-25 12:39:34,178] {connectionpool.py:442} DEBUG - http://storage.osdu-azure.svc.cluster.local:80 "PUT /api/storage/v2/records HTTP/1.1" 400 None
[2021-10-25 12:39:34,179] {process_manifest_r3.py:131} ERROR - Request error.
[2021-10-25 12:39:34,179] {process_manifest_r3.py:132} ERROR - Response status: 400. Response content: {"code":400,"reason":"Invalid ACL","message":"Acl not match with tenant or domain"}.
[2021-10-25 12:39:34,179] {authorization.py:137} ERROR - {"code":400,"reason":"Invalid ACL","message":"Acl not match with tenant or domain"}
[2021-10-25 12:39:34,179] {single_manifest_processor.py:79} WARNING - Can't process entity SRN: opendes:reference-data--MaterialType:WTS
[2021-10-25 12:39:34,179] {single_manifest_processor.py:255} INFO - Processed ids []
[2021-10-25 12:39:34,179] {process_manifest_r3.py:173} INFO - Processed ids []
[2021-10-25 12:39:34,735] {__init__.py:62} DEBUG - Backend: None, Lineage called with inlets: [], outlets: []
[2021-10-25 12:39:35,139] {taskinstance.py:1070} INFO - Marking task as SUCCESS.dag_id=Osdu_ingest, task_id=process_single_manifest_file_task, execution_date=20211025T123739, start_date=20211025T123909, end_date=20211025T123935
[2021-10-25 12:39:35,484] {base_job.py:197} DEBUG - [heartbeat]
[2021-10-25 12:39:35,485] {local_task_job.py:102} INFO - Task exited with return code 0
**Expected result**
Task and DAG-run marked as failureM10 - Release 0.13https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/91Date-Time validation causing ingestion failure2022-03-21T15:29:30ZKeith WallDate-Time validation causing ingestion failureDate values are failing schema validation on ingestion if the dates are not in UTC, or don't contain a time-zone offset.
This is a new validation, that requires date-times to conform to RFC3339. The intent is good, but it does not confo...Date values are failing schema validation on ingestion if the dates are not in UTC, or don't contain a time-zone offset.
This is a new validation, that requires date-times to conform to RFC3339. The intent is good, but it does not conform to schemas, or to our data.
There is a large volume of date in which a date is known, but for which no time or time zone is provided. Recognizing this, the OSDU schema only require that dates be a string.
As a large volume of data is managed that does not have time zone information, our options are either to reject all these dates, or to ingest and maintain them in original format.
If we force a time zone change on data by putting it into a UTC format when we really do not know the time zone, we are corrupting the data.
I have consulted the Enterprise Architecture Geomatics team, and asked if we should (1) Do not load dates with unknown time zones or (2) maintain the dates in as-provided form. There was complete agreement that the industry has a large volume of data with dates without time zones, and we can still make use of those dates, but must not modify them by adding a default time zone.
Please remove the date validation from ingestion.M9 - Release 0.12Kishore BattulaShrikant GargSpencer Suttonsuttonsp@amazon.comYan Sushchynski (EPAM)Kishore Battulahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/89Move Airflow common logic to osdu-airflow-lib project2021-09-27T03:39:50ZSiarhei Khaletski (EPAM)Move Airflow common logic to osdu-airflow-lib projectSome of DAGs have dependencies on code from Ingestion DAGs repository.
For instance, the MRs bring updates for parsers DAGs:
- https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/merge_requests/24
-...Some of DAGs have dependencies on code from Ingestion DAGs repository.
For instance, the MRs bring updates for parsers DAGs:
- https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/merge_requests/24
- https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-zgy-conversion/-/merge_requests/36
- https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/merge_requests/149
this requires `UpdateStatusOperator` class from Ingestion DAGs project. It means, that it will require to deploy Ingestion DAGs code into environment to be able to use it for DAGs from MRs above.
The real case now is for WISTML Parser, where we have to add `osdu_manifest` code to use operators for WITSML Parser DAG steps.
**Expects**: All the Airflow related logic (operators, hooks, etc.) is able to be installed into environment independently (using pip) of Ingestion DAGs code base.M9 - Release 0.12Siarhei Khaletski (EPAM)Siarhei Khaletski (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/88Support a policy for missing timezone information in date time data2022-03-16T07:57:57ZBen LasscockSupport a policy for missing timezone information in date time dataMeta information such as units (m, ft, g/cm^3, coordinate reference frame) is required for the ingestion of data. This information is crucial for normalizing the frame of reference.
For date time information, ideally this data is in UTC...Meta information such as units (m, ft, g/cm^3, coordinate reference frame) is required for the ingestion of data. This information is crucial for normalizing the frame of reference.
For date time information, ideally this data is in UTC or has meta information available to transform it to UTC. Currently the schema service is throwing out records that don't conform to this requirement.
However, there is a large body of data already the environment where this information is not available, and so we need to make date time the exception, and waive the requirement to provide UTC information. A counterpoint to waving this requirement is where we have activities like active drilling, where correct date times in UTC are required.
1. Define under what circumstances where date time UTC conversion can be waived.
2. Create a specification for what the behavior of the ingestion application should be to support both the waiving of date time frame of reference, or enforcing it, depending on the policy defined in (1).
e.g. should the ingestion provide warnings that date time meta information isn't available. Or should we have a flag or field in the record to allow the user to waive the requirement etc.Keith WallKeith Wallhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/87WIP - Airflow 2+ Adoption2022-03-16T07:33:32ZBen LasscockWIP - Airflow 2+ AdoptionThis issue is a place to track adoption of Airflow 2+ by the various CSP's.
Running Airflow 2+ with the experimental API is backward compatibility with the current workflow services. However it does provide potential performance improv...This issue is a place to track adoption of Airflow 2+ by the various CSP's.
Running Airflow 2+ with the experimental API is backward compatibility with the current workflow services. However it does provide potential performance improvements, particularly around the scheduler. Please update you current status here.
AWS - M9 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)
Azure - M10 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)
IBM - M9 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)
GCP - M8 timeline
* [ ] Not Started
* [ ] Airflow 2+ (experimental API)
* [x] Airflow 2+ (stable API)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/86Enable Support for Packaged DAGs2021-08-26T14:16:09Zharshit aggarwalEnable Support for Packaged DAGsThe [ADR](https://community.opengroup.org/osdu/platform/data-flow/home/-/issues/47) for package DAG support has been approved and we need to structure the Ingestion Dags repository to support packaged DAGs. The new structure will look li...The [ADR](https://community.opengroup.org/osdu/platform/data-flow/home/-/issues/47) for package DAG support has been approved and we need to structure the Ingestion Dags repository to support packaged DAGs. The new structure will look like as below
**New folder structure**
```
├── osdu_manifest
│ ├── __init__.py
│ ├── libs
│ │ ├── __init__.py
│ │ └── utils.py
│ └── operators
│ | ├── __init__.py
│ | └── customOperator1.py
| |___ hooks
| | |__ __init__.py
| |
| |___ configs
| |__ __init__.py
|
|___ osdu-ingest-r3.py
```
Changes to support this will include
- Restructuring the folders
- Fixing any import statements
- Minor changes to run existing testsM8 - Release 0.11harshit aggarwalharshit aggarwalhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85Integration E2E Tests for manifest ingestion (GONRG-3300) - GCP2021-11-15T18:44:33ZKishore BattulaIntegration E2E Tests for manifest ingestion (GONRG-3300) - GCPCurrently manifest ingestion doesn't have tests which invokes the workflow service to trigger the ingestion and validates it by fetching the ingested records through search service (or) storage service. This validates the airflow, workfl...Currently manifest ingestion doesn't have tests which invokes the workflow service to trigger the ingestion and validates it by fetching the ingested records through search service (or) storage service. This validates the airflow, workflow service and all related services are running with right set of configurations for manifest ingestion.
**Acceptance Criteria**: Add new E2E tests which can validate the manifest ingestion by triggering it through workflow service
https://community.opengroup.org/osdu/platform/data-flow/home/-/issues/49#note_58471M10 - Release 0.13Chris ZhangChris Zhanghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/82Manifest ingestion does not show any updates in airflow when backslash charac...2021-10-19T19:16:21ZNaufal Mohamed NooriManifest ingestion does not show any updates in airflow when backslash character used in json body**Description**:
Using manifest ingestion (DAG) workflow service, when user insert backslash \ into json body manifest the workflow run stucks in SUBMITTED status. There is also no trace of the runID running in the Airflow log.
**Steps...**Description**:
Using manifest ingestion (DAG) workflow service, when user insert backslash \ into json body manifest the workflow run stucks in SUBMITTED status. There is also no trace of the runID running in the Airflow log.
**Steps to reproduce:**
a) Insert the body json into DAG worklow body. [With_Backslash_BodyData.json](/uploads/c2f2e8e8241df526830a73cc9ba2336a/With_Backslash_BodyData.json)
b) When submit the body json into base_url/api/workflow/v1/workflow/Osdu_ingest/workflowRun the workflow is submitted succesfully with the following response:
{
"workflowId": "dev:Osdu_ingest",
"runId": "4327f575-e7b3-490f-a1ee-b1e2e950c2a4",
"startTimeStamp": 1627041278115,
"status": "submitted",
"submittedBy": "naufal.noori@katalystdm.com"
}
c) After a while, check DAG run status and the workflow still showing the run is in submitted status. And no trace of the run ID in the Airflow log (This follow up check was done after 24 hours):
_Endpoint_: base_url/api/workflow/v1/workflow/Osdu_ingest/workflowRun/4327f575-e7b3-490f-a1ee-b1e2e950c2a4
_Response_:
{
"workflowId": "dev:Osdu_ingest",
"runId": "4327f575-e7b3-490f-a1ee-b1e2e950c2a4",
"startTimeStamp": 1627041278115,
"status": "submitted",
"submittedBy": "naufal.noori@katalystdm.com"
}
d) When a second trial run was conducted by replacing \ char with empty char, the workflow run was running perfectly and shows trace of running in Airflow log. [With_NO_Backslash_BodyData.json](/uploads/9fdbc2a59a930444feeb6bfacd1e1200/With_NO_Backslash_BodyData.json)
**Expectation**:
We are expecting that the workflow run to failed our request with clear and meaningful error message i.e. Request is failed. There is non-allowed special characters in line #something to line #something in your json body.
**Reason**
It will be a confusion for users to have a run successfully submitted but stuck in the process without any log trace whatsoever.
cc @debasiscM9 - Release 0.12https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/81While using manifest_ingestion (Osdu_ingest), the tags field for the Wellbore...2022-08-28T18:47:20ZKamlesh TodaiWhile using manifest_ingestion (Osdu_ingest), the tags field for the Wellbore data is populated in the payload, it appears that the tags field is not getting ingestedThe issue is that when the tags field for the Wellbore data is populated in payload while ingesting the wellbore data, it appears that the tags field is not getting ingested. There are no warnings or errors in the Airflow logs regarding ...The issue is that when the tags field for the Wellbore data is populated in payload while ingesting the wellbore data, it appears that the tags field is not getting ingested. There are no warnings or errors in the Airflow logs regarding this and the wellbore data is getting ingested, but the tags field is missing.
Note: When tried to insert Wellbore data with tags field populated, using the storage API, it works fine.
The details are attached in the word docs.
tagsFieldIngestIssue.docx contains the payload used during ingestion and the queries done to check the tags field data.
tagsFieldStorageSearch.docx contains the payload used while creating the wellbore record with tags field using the storage API
The test was done on two platforms (AWS and GCP)
The DAG/run details for the GCP
{
"workflowId": "ef82cba0-0e45-4df3-91bf-4df1553102d3",
"runId": "22821aa9-82a2-4910-9e3f-d1e27addb49d",
"startTimeStamp": 1627328994856,
"endTimeStamp": 1627329575098,
"status": "finished",
"submittedBy": "kamlesh_todai@osdu-gcp.go3-nrg.projects.epam.com"
}
The DAG/run details for the AWS runId: 57a9adc0-aabb-4bb9-8154-561b5c12412f
Have not tried on IBM and Azure to see whether the behavior is the same or different.
[tagsFieldIngestIssue.docx](/uploads/83fbd805ec66927bb850abc683ef076b/tagsFieldIngestIssue.docx)
[tagsFieldStorageSearch.docx](/uploads/9a494e21f5d55ea9a9d337974b1eb6f7/tagsFieldStorageSearch.docx)
@ChrisZhang @ethiraj @debasisc @Wibben @Kateryna_Kurach @anujgupta @manishkKamlesh TodaiKamlesh Todaihttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/80M7 Manifest based ingestion - Load Testing2022-01-18T20:06:59ZBen LasscockM7 Manifest based ingestion - Load Testing## Definitions
Load Testing - The number of records or manifests that can be processes at a time.
## Background
Throughout M7 there have been a number of performance improvements delivered by EPAM, as well as work on improving issues ...## Definitions
Load Testing - The number of records or manifests that can be processes at a time.
## Background
Throughout M7 there have been a number of performance improvements delivered by EPAM, as well as work on improving issues with configuration etc. We expect this has made a significant improvement to the capacity of the manifest based ingestion, but we don't have a specific figure.
**The process of load testing should be repeatable, with the expectation it will be applied to the upcoming Airflow 2+ changes.**
## Requirements
We need test the "5000 manifest test" @debasisc @todaiks to be re-run on the M7 release. The result should a binary pass/fail and the wall-time for executing the job. For completeness (Table 1) we show a set of recommended test cases that we believe should ultimately be automated and runnable through the QA group.
| Test | Issue | AWS | Azure | GCP | IBM |
| ----------- | ----------- | --- | ----- | --- | --- |
| the "5000 manifest" | Our current baseline | | | |
| 1 Manifest with 5,000 records | | | | |
| 1 Manifest with 20,000 records | | | | |
| 1 Manifest with 50,000 records | Limit on the size of the request body | | | |
| 50K manifests in multiple requests, not simultaneously | Airflow 1.X doesn’t allow sending multiple requests (Fixed in Airflow 2.0) | | | |
| chunks of 50, 1000 DAG runs | 1. max_active_runs (50) limitation 2. limitation of workflow service: java heap error Issue 64 3. Storage Service has a limitation of storing no more than 500 records/s | | | |
| chunks of 1000 | see above | | | |
| 50 DAG runs | | | | | |
| Launch several different DAGS simultaneously | | | |
| Ingest the Volve data | to promote adoption | | |
| Ingest the TNO data | to promote adoption | | |M7 - Release 0.10Chris ZhangChris Zhang