Manifest Ingestion DAG issueshttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues2024-03-20T15:49:55Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/105Performance testing in R3 M11 - Need to determine the maximum size of the pa...2024-03-20T15:49:55ZKamlesh TodaiPerformance testing in R3 M11 - Need to determine the maximum size of the payload allowed during ingestion using Osdu_ingest DAGAll,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0
The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used ...All,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0
The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used the time required to ingest the data goes down ( performance gain)
Some observations of the process used.
There is a difference in the python scripts that are used to generate the payload for Ingestion and batch_upload.
The python script that generates the payload for ingestion, generates records of kind: “opendes:wks:master-data--Organisation:1.0.0”. So when user specifies 5 records, it generates 5 records for kind Organisation
The python script that generates the payload for batch_upload, generates records of "osdu:wks:master-data--Organisation:1.0.0 and osdu:wks:reference-data--ContractorType:1.0.0". So when a user specifies 5 records, it generates records of kind
Organisation and ContractorType. So it is actually generating twice the amount of records that were specified.
At present to establish the benchmark for the performance, we are using the number of records. Probably because it is convenient to tell the users that to ingest a certain number of wells for example takes x amount of time.
But the well record size may vary from one user environment to another and hence performance numbers derived using a number of records may not hold true in all the situations.
How much one can ingest in one job or one time is based on the size of the payload in KB. So I think we should use the payload size in KB to establish the benchmark. The number of records that can fit in the payload would depend on the size of the records
I have done the testing in the IBM environment, but the test for 50000 records in batch_upload seems to be failing in all the environments.
I do not know, where the size limit is coming from? REST API, Network, Airflow, DAG implementation.
Nor do I know whether that size is configurable?
It is important for us to understand where that limitation is coming from and whether it is a hard limit or a configurable limit.
The python script file should honor that limit and generate data/payload files (multiple) containing a correct number of records to avoid failures.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/98WIP - Performance Benchmarking2023-10-23T07:51:46ZBen LasscockWIP - Performance BenchmarkingThe purpose of this WIP issue is plan around how we are going to benchmark ingestion performance.
Need to address the performance of ingestion mechanisms.
Expected performance in production env is in excess of 33k records per minute - W...The purpose of this WIP issue is plan around how we are going to benchmark ingestion performance.
Need to address the performance of ingestion mechanisms.
Expected performance in production env is in excess of 33k records per minute - Wells, well logs, trajectories etc...
- [ ] @Devendra_R @npickus to connect with @todaiks & @chad if possible to confirm testing approach, timing and feedback cycles
Data examples from real use cases include wells, wellbores, trajectories, etc.. (already using tno and volve)
- [ ] Nick to check with Chevron teams to see if there is an opportunity for the teams to schedule a test of the current manifest ingestion in a real production environment to compare to current test rates within the Forum. Team can use Script from Jean Rainauld and test with the same synthetic data as the forum tests.
[testing info](https://community.opengroup.org/osdu/platform/data-flow/ingestion/osdu-airflow-lib/-/merge_requests/17)
- [ ] @debasisc to follow up with CSPs to gain alignment on CSPs testing ingestion in their environments
*Issues
- No defined custodians & developers for Manifest ingestion
- data sets used for testing not representative of real data - only master data
- testing requires close coordination with CSPs -
## Load Testing & Performance
[performance changes since M6](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/101)
[load testing Issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/80)
*Old test results
[M7](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/tree/master/TeamD_M7/ManifestLoadTesting), [M8](https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M8/Results/OSDU_LoadTesting_Results_M8_TeamD.xlsx)
### Basic Load testing
Testing the ingestion of 500, 1000, 50,000 record manifests.
Uses synthetic manifests to perform basic testing of the ingestion. Load testing is run by
pre shipping for each release, one release in arreas, which means MX is testing during the
development cycle of M(X+1). A spreadsheet showing the pass/fail and timing per CSP is provided
by pre shipping on conclusion of the test.
Using the Airflow console, additional information regarding run time and latency of the Airflow scheduler can be found using the Airflow console [Gantt Chart](https://airflow.apache.org/docs/apache-airflow/stable/ui.html). This data provides a view of where the performance bottlenecks might be.
Assets:
- [x] Basic load testing passed True/False (per release)
- [x] Timing information (scaling as a function of manifest size).
- [ ] Snap shots of the Airflow [Gantt Chart](https://airflow.apache.org/docs/apache-airflow/stable/ui.html)
### Advanced Load testing
The the other items (beyond the basic) [load testing](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/80) give insights into the sensitivity of the performance on the Airflow configuration. This maybe carried out if resources are available.
## Defining standards @npickus
Today teams are loading 8-10 Million records, including validations, outside of the Manifest or CSV ingestion mechanisms in ~5 hours for a rate of about 33k Records per minute.
- [x] Collect x2 user stories from operators.
- [x] Define OSDU EA/community expectations.
## For applications developers (local mode) @epeysson
There is a use case for applications developers to run the Airflow part of the ingestion locally (with core services provided accessible through a REST API). This local mode can be used to better profile the performance of just the Airflow component, agnostic.
Standalone installation instructions are [here](https://community.opengroup.org/osdu/platform/deployment-and-operations/individual-airflow).
- [ ] Complete basic load testing with standalone Airflow.Devendra RawatDevendra Rawathttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/103Manifest Based Ingestion - Operator Performance Benchmarking for M10 Release2023-10-23T07:51:06ZDevendra RawatManifest Based Ingestion - Operator Performance Benchmarking for M10 ReleasePurpose of this WIP issue is to plan around how we are going to benchmark ingestion performance. Testing to be performed on Data Example from Reference Data, Master Data (Wells, Wellbores), and Work Product Components (Trajectory), etc.
...Purpose of this WIP issue is to plan around how we are going to benchmark ingestion performance. Testing to be performed on Data Example from Reference Data, Master Data (Wells, Wellbores), and Work Product Components (Trajectory), etc.
This is to gauge the operator acceptance on performance upgrade as benchmarked in [issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/101). The upgrade has shown significant improvement in throughput and speed as highlighted.
| Manifest Type | Operator 1 |Operator 2 | Operator 3 |Operator 4 | Operator 5 |Operator 6 |
| ------ | ------ | ------| ------ | ------| ------ | ------|
| Reference Data |
| Master Data |
| Work Product Component |Devendra RawatDevendra Rawathttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/65Move to Airflow 2.0 ADR2023-10-23T07:49:41ZBen LasscockMove to Airflow 2.0 ADR# Moving to Airflow 2.0
## Status
* [x] Proposed
* [ ] Trialing
* [ ] Under review
* [x] Approved
* [ ] Retired
### Decision
This decision will authorize the port of the ingestion workflow and associated DAG's (see below) to support ...# Moving to Airflow 2.0
## Status
* [x] Proposed
* [ ] Trialing
* [ ] Under review
* [x] Approved
* [ ] Retired
### Decision
This decision will authorize the port of the ingestion workflow and associated DAG's (see below) to support Airflow 2.0, deprecating support for Airflow 1.10.x after a transitionary period.
## Deprecating strategy
The existing [experimental](https://airflow.apache.org/docs/apache-airflow/stable/deprecated-rest-api-ref.html) api is still be available in Airflow 2.0 here:\
/api/experimental/
_To restore these APIs while migrating to the stable REST API, set enable_experimental_api option in [api] section to True._
A deprecating strategy will be implemented providing providers with a transitionary period to port to Airflow 2.0. During this period, common code will run on Airflow 2.0 by default, however, configuration (possibly environment variables) will allow code written for 1.10.x to be supported.
A guide to detailing the backward compatibility changes can be found [here](https://github.com/apache/airflow/blob/main/UPDATING.md).
### Dependencies
The following task list (provided by EPAM) gives an overview of the dependencies and level of effort required to implement the move.
| | Task | Estimate | Assigned to | Ticket |
| --- | --- | --- | --- | --- |
| 1. | Install Airflow 2.0 to all environments | | All CSP's |
| 2. | Airflow 2.0 required DAG code changes | | |
| 2.1 | Manifest-based ingestion (Python) | 5 days | (GCP) | [issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/60) |
| 2.2 | WITSML parser (Python) | 5 days | (GCP) | [issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/61) |
| 2.3 | SEGY -> OpenVDS | 5 days | (GCP) | |
| 2.4 | SEGY -> ZGY | 5 days | | [Seismic](osdu/platform/data-flow/ingestion/segy-to-zgy-conversion#4) |
| 2.5 | CSV Parser | 5 days | (GCP) | [issue](https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/33) |
| 3. | Workflow Services | | |
| 3.1 | Common Code (Java) | 10 days | | [issue](https://community.opengroup.org/osdu/platform/data-flow/data-workflow-framework/data-workflow/-/issues/1) |
| 3.2. | AWS | | | |
| 3.3. | Azure | | | |
### Motivation
The release of Airflow 2.0 is a significant upgrade from the previous versions, and includes improvements and new features that support our goals for the ingestion workflow [see link](https://www.astronomer.io/blog/introducing-airflow-2-0).
In the context of our goals for progressing OSDU ingestion project:
**Ease of on boarding developers**
In Airflow <2.0 the "experiment" REST API is being deprecated, with a move a new comprehensive "stable" REST API supported by Airflow >=2.0. Moving to the new Airflow should ensure that the code being created by the OSDU will enjoy greater support online and will be easier for new developers to adopt and extend.
The new Airflow 2.0 **Task Flow API** simplifies the passing for information between tasks in a DAG. This feature does not solve performance problems related to passing large manifests through the workflow, however that is the focus of another effort "Manifest by reference". An example of the new taskflow api can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html).
**Latency**
One of the major features of Airflow 2.0 is a new high availability + low latency scheduler.
Measurements made during the OSDU Airflow 2.0 PoC (conducted by EPAM) it was found that using Airflow 1.10.14, latency between tasks (productive work) could be as much as 30 seconds; with and equivalent code running on Airflow 2.0, this overhead was reduced to 5 seconds.
There has been an issue that Airflow <1.10.15 would be default, only allow the creation of one DAG per second, potentially creating latency issues. Although this was solved with the release [1.10.15](https://github.com/apache/airflow/pull/10633), dependency on a minor version has created a variance in the behavior of the ingestion workflow across providers, and moving to Airflow >2.0 will solve this. A complete list of bug fixes and improvements through to the current release of Airflow 2.1.0 can be found [here](https://airflow.apache.org/docs/apache-airflow/2.0.1/dag-serialization.html).
**Throughput and scalability**
In the current version of Airflow, the Airflow scheduler has been found to fail silently once the max_active_runs_per_dag (configuration default is 20) is exceeded. This creates a variance on the behavior of the ingestion workflow based on the specific configuration of the OSDU platform provider. During the Airflow 2.0 PoC it was found that this problem was solved.
_[A] user can now launch additional "replicas" of the Scheduler to increase the throughput of their Airflow Deployment._\
The option for schedulers creates the potential for providers to scale the ingestion workflow by provisioning more resources, but also removes a single point of failure (when using two or more schedulers), providing a more resilient system.
### Project Risks
| Risk Category | Risk Description | Likelihood | Impact | Comments |
| --- | --- | --- | --- | --- |
| NA | NA | NA | NA | NA |
### Organizational Management
| Name | Project Role | Time Zone |
| --- | --- | --- |
| Kateryna | GCP | CST |
| Kishore | Azure | IST |
| Shrikant | IBM | IST |
| Greg Wibben | AWS | CST |
| Ben | Manifest | CST |
| Chad | Data Loading | CET | |
| Fernando | CSV | | |
| Sacha | [Seismic](osdu/platform/data-flow/ingestion/segy-to-zgy-conversion#4) | | |https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/109Manifest ingestion by Reference - error while running DAG for first time2023-02-13T08:56:59ZNaveen RamachandraiahManifest ingestion by Reference - error while running DAG for first timeTeam,
For Azure, we are trying to implement feature Manifest by reference but getting issues while running DAG. Attached the error log and the screenshot of the DAG Graph. Please do help[DAG_-error.log](/uploads/80013a342d3e6d5fbfd843f...Team,
For Azure, we are trying to implement feature Manifest by reference but getting issues while running DAG. Attached the error log and the screenshot of the DAG Graph. Please do help[DAG_-error.log](/uploads/80013a342d3e6d5fbfd843fcf27c0707/DAG_-error.log)![DAG-_tree](/uploads/e09e4f01cfb6b1a4166a4df2efa83e4d/DAG-_tree.png)M16 - Release 0.19Jayesh BagulJayesh Bagulhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/96Integration E2E Tests for manifest ingestion - IBM2022-11-11T12:30:22ZChris ZhangIntegration E2E Tests for manifest ingestion - IBMThis is to track the IBM team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85This is to track the IBM team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85M12 - Release 0.15Shrikant Gargjingdong sunShrikant Garghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/104Manifest ingestion DAG is not creating master-data--Wellbore2022-11-11T12:29:06ZThomas DombrowskyManifest ingestion DAG is not creating master-data--WellboreWhen running the manifest ingestor DAG with the attached payload, no records are inserted into the storage.
The Airflow logs show no error, so it is unknown why the ingestion fails.
Expected: The manifest contains a single record. The r...When running the manifest ingestor DAG with the attached payload, no records are inserted into the storage.
The Airflow logs show no error, so it is unknown why the ingestion fails.
Expected: The manifest contains a single record. The record should be inserted into storage during the ingestion.
Expected: The Airflow logs need to be improved. Logs should show the payload that was received and what processing has occurred. If there are errors that prevent the ingestion of data, these should be fully logged.
[wellbore-wks_sandbox.json](/uploads/10c1885dffa4288891cb48a13ad1e489/wellbore-wks_sandbox.json)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108Manifest by reference : error while DAG run2022-11-11T12:26:55ZDevdatta SantraManifest by reference : error while DAG run**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_r...**While running the manifest by reference DAG, we are getting the following error in "validate_manifest_schema_task".**
```
[2022-10-13 08:56:52,287] {standard_task_runner.py:76} INFO - Running: ['***', 'tasks', 'run', 'Osdu_ingest_by_reference', 'validate_manifest_schema_task', '2022-10-13T08:56:41.095723+00:00', '--job-id', '13024', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/osdu-ingest-by-reference-r3.py', '--cfg-path', '/tmp/tmpv4ta88jt', '--error-file', '/tmp/tmpkxyyt6ok']
[2022-10-13 08:56:52,288] {standard_task_runner.py:77} INFO - Job 13024: Subtask validate_manifest_schema_task
[2022-10-13 08:56:52,390] {logging_mixin.py:104} INFO - Running <TaskInstance: Osdu_ingest_by_reference.validate_manifest_schema_task 2022-10-13T08:56:41.095723+00:00 [running]> on host ***-worker-0.***-worker.osdu.svc.cluster.local
[2022-10-13 08:56:52,509] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=Osdu_ingest_by_reference
AIRFLOW_CTX_TASK_ID=validate_manifest_schema_task
AIRFLOW_CTX_EXECUTION_DATE=2022-10-13T08:56:41.095723+00:00
AIRFLOW_CTX_DAG_RUN_ID=83247382-218b-44b5-b1c1-0b921ee67dd6
[2022-10-13 08:57:04,974] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/validate_manifest_schema_by_reference.py", line 110, in execute
manifest_data = self._get_manifest_data_by_reference(context=context,
File "/home/airflow/.local/lib/python3.8/site-packages/osdu_airflow/operators/mixins/ReceivingContextMixin.py", line 105, in _get_manifest_data_by_reference
retrieval_content_url = retrieval.json()["delivery"][0]["retrievalProperties"]["signedUrl"]
KeyError: 'delivery'
[2022-10-13 08:57:04,977] {taskinstance.py:1544} INFO - Marking task as FAILED. dag_id=Osdu_ingest_by_reference, task_id=validate_manifest_schema_task, execution_date=20221013T085641, start_date=20221013T085652, end_date=20221013T085704
```
It would be very helpful to get any resolution regarding this.
======================================
Updates about the new errors encountered:-
1) `AttributeError: 'dict' object has no attribute 'to_JSON'` - as mentioned in this below comment
https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/108#note_159282
2) "Schema is not present" error from Dataset service while running the DAG
```
2022-10-19 12:03:14.191 DEBUG 1 --- [nio-8080-exec-1] .m.m.a.ExceptionHandlerExceptionResolver : Using @ExceptionHandler org.opengroup.osdu.dataset.util.GlobalExceptionMapper#handleAppException(AppException)
2022-10-19 12:03:14.193 WARN 1 --- [nio-8080-exec-1] o.o.o.c.common.logging.DefaultLogWriter : dataset-registry.app: Schema is not present
AppException(error=AppError(code=404, reason=Schema Service: get 'opendes:wks:dataset--File.Generic:1.0.0', message=Schema is not present, errors=null, debuggingInfo=null, originalException=null), originalException=null)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.validateDatasets(DatasetRegistryServiceImpl.java:233)
at org.opengroup.osdu.dataset.service.DatasetRegistryServiceImpl.createOrUpdateDatasetRegistry(DatasetRegistryServiceImpl.java:112)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi.createOrUpdateDatasetRegistry(DatasetRegistryApi.java:66)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$FastClassBySpringCGLIB$$774ab2c5.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:123)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:61)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)
at org.opengroup.osdu.dataset.api.DatasetRegistryApi$$EnhancerBySpringCGLIB$$649af8f9.createOrUpdateDatasetRegistry(<generated>)
```Valentin GauthierValentin Gauthierhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/47Documentation: Best practices for ingestion DAGs2022-09-15T23:49:37ZAlan HensonDocumentation: Best practices for ingestion DAGsWe need a guide or document that offers best practices recommendations to those constructing ingestion (or enrichment) related DAGs. This document should cover things such as:
- DAG Operator composability recommendations
- Performance c...We need a guide or document that offers best practices recommendations to those constructing ingestion (or enrichment) related DAGs. This document should cover things such as:
- DAG Operator composability recommendations
- Performance considerations
- Recommended property use (see https://community.opengroup.org/osdu/documentation/-/issues/80)
- Othershttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/26Documentation: Manifest Ingestion User Guide2022-09-15T23:49:36ZAlan HensonDocumentation: Manifest Ingestion User Guidehttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/81While using manifest_ingestion (Osdu_ingest), the tags field for the Wellbore...2022-08-28T18:47:20ZKamlesh TodaiWhile using manifest_ingestion (Osdu_ingest), the tags field for the Wellbore data is populated in the payload, it appears that the tags field is not getting ingestedThe issue is that when the tags field for the Wellbore data is populated in payload while ingesting the wellbore data, it appears that the tags field is not getting ingested. There are no warnings or errors in the Airflow logs regarding ...The issue is that when the tags field for the Wellbore data is populated in payload while ingesting the wellbore data, it appears that the tags field is not getting ingested. There are no warnings or errors in the Airflow logs regarding this and the wellbore data is getting ingested, but the tags field is missing.
Note: When tried to insert Wellbore data with tags field populated, using the storage API, it works fine.
The details are attached in the word docs.
tagsFieldIngestIssue.docx contains the payload used during ingestion and the queries done to check the tags field data.
tagsFieldStorageSearch.docx contains the payload used while creating the wellbore record with tags field using the storage API
The test was done on two platforms (AWS and GCP)
The DAG/run details for the GCP
{
"workflowId": "ef82cba0-0e45-4df3-91bf-4df1553102d3",
"runId": "22821aa9-82a2-4910-9e3f-d1e27addb49d",
"startTimeStamp": 1627328994856,
"endTimeStamp": 1627329575098,
"status": "finished",
"submittedBy": "kamlesh_todai@osdu-gcp.go3-nrg.projects.epam.com"
}
The DAG/run details for the AWS runId: 57a9adc0-aabb-4bb9-8154-561b5c12412f
Have not tried on IBM and Azure to see whether the behavior is the same or different.
[tagsFieldIngestIssue.docx](/uploads/83fbd805ec66927bb850abc683ef076b/tagsFieldIngestIssue.docx)
[tagsFieldStorageSearch.docx](/uploads/9a494e21f5d55ea9a9d337974b1eb6f7/tagsFieldStorageSearch.docx)
@ChrisZhang @ethiraj @debasisc @Wibben @Kateryna_Kurach @anujgupta @manishkKamlesh TodaiKamlesh Todaihttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/94Integration E2E Tests for manifest ingestion - AWS2022-08-24T14:48:52ZChris ZhangIntegration E2E Tests for manifest ingestion - AWSThis is to track the AWS team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85This is to track the AWS team's work for Integration E2E Tests for manifest ingestion.
Related to issue 85 https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/85M10 - Release 0.13Gustavo UrdanetaGustavo Urdanetahttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/54Redundant steps executed: validate schemas and ensure referential integrity2022-08-23T11:19:22ZBrady Spiva [AWS]Redundant steps executed: validate schemas and ensure referential integrity## Expected behavior
The validate schemas and ensure referential integrity operations should only need to be executed once per manifest
## Observed behavior
In the [top-level DAG definition](https://community.opengroup.org/osdu/platform...## Expected behavior
The validate schemas and ensure referential integrity operations should only need to be executed once per manifest
## Observed behavior
In the [top-level DAG definition](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/blob/master/src/dags/osdu-ingest-r3.py#L131), you can see “validate schema” and “ensure integrity” operators are executed as part of the DAG:
`branch_is_batch_op >> validate_schema_operator >> ensure_integrity_op >> process_single_manifest_file >> update_status_finished_op`
But then diving deeper into the `process_single_manifest_file` operator, it ALSO [validates schemas and ensures referential integrity](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/blob/master/src/dags/libs/processors/single_manifest_processor.py#L81), resulting in redundant API calls.
This problem will go unnoticed for small workloads, but for larger workloads the increased latency will start to quickly add up. Using the TNO and Volve sample ingestion dataset as an example, there are about **24,000 manifest files**. If this redundancy adds just 2 extra API calls per manifest ( one for schema validation, one for referential integrity checks ), and each API request takes 250 milliseconds, then this would increase the overall ingestion time by:
24,000 manifests * ( .25 seconds redundancy * 2 requests ) / 60 seconds per minute = **200 minutes, or 3.3 hours**.
As the ingestion workload size increases, this redundancy becomes a non-trivial amount of time. Naturally, your mileage may vary! I'm sure we'll see different latency results for different networks, different customers, different Cloud Providers, etcetera. I'm confident customer experiences will be improved by reducing this latency.
## Some proposed solutions
1) remove the [duplicated referential integrity call](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/blob/master/src/dags/libs/processors/single_manifest_processor.py#L97), the results of this operation aren't used anyway
2) change the way the manifest is obtained for the `process_single_manifest_file` operator, allowing the removal of the [duplicated schema validation call](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/blob/master/src/dags/libs/processors/single_manifest_processor.py#L98). We could use Airflow mechanisms ( Xcoms, variables, etcetera ) to reuse the manifest from the `validate_schema_operator`, but that might affect the atomicity of the operator.
What do you think?JoeSiarhei Khaletski (EPAM)Kateryna Kurach (EPAM)Alan HensonJoehttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/48Manifest ingestion fails with large number of wells2022-08-23T11:19:22ZAlan HensonManifest ingestion fails with large number of wellsThis issue is a mirror of the issues created by Pre-Shipping Team A, which can be found here: https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/64This issue is a mirror of the issues created by Pre-Shipping Team A, which can be found here: https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/64Kateryna Kurach (EPAM)Kateryna Kurach (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/42Manifest Ingestion: Refactor to have standalone capability2022-08-23T11:19:21ZAlan HensonManifest Ingestion: Refactor to have standalone capabilityIn seeking maximum reusability, the manifest ingestion workflow should take a bottoms-up approach where:
- Each DAG operator is Airflow agnostic and capable of running within a Python runtime environment without Airflow
- Each DAG opera...In seeking maximum reusability, the manifest ingestion workflow should take a bottoms-up approach where:
- Each DAG operator is Airflow agnostic and capable of running within a Python runtime environment without Airflow
- Each DAG operator is able to run as a script taking the appropriate inputs, performing its work, and then providing the expected outputs (interacting with OSDU services is expected)
- A DAG workflow should run end-to-end without requiring Airflow - in essence, running as a script from the command line with the correct inputs
- The DAG workflow should encapsulate the above into an Airflow workflow
The workflow should be executable outside of Airflow. The Airflow components should abstract the Airflow pieces from the core workflow itself.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/41Manifest Ingestion: Refactor syntax and validation logic into the common inge...2022-08-23T11:19:17ZAlan HensonManifest Ingestion: Refactor syntax and validation logic into the common ingestion Python LibraryThe first step in making the syntax and validation logic within the manifest ingestion DAG reusable is to refactor to a common place. A new ingestion python library should be created (see https://community.opengroup.org/osdu/platform/dat...The first step in making the syntax and validation logic within the manifest ingestion DAG reusable is to refactor to a common place. A new ingestion python library should be created (see https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/40) and the syntax and validation logic should be refactored to that library for reuse.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/55Bug in utils.py method "split_id" prevents WP manifest ingestion2022-08-23T11:19:16ZSpencer Suttonsuttonsp@amazon.comBug in utils.py method "split_id" prevents WP manifest ingestionThere is a bug in the utils.py method `split_id` where it assumes that any number at the end must be a version for the record and must be removed even when the number at the end is not a version but part of the actual record id. This mak...There is a bug in the utils.py method `split_id` where it assumes that any number at the end must be a version for the record and must be removed even when the number at the end is not a version but part of the actual record id. This makes it so every single WP manifest that references any master data can't be ingested.
**Details:**
WP manifests reference master data like this:
`{{data-partition-id}}:wks:master-data--Well:1000:`
When this gets to the referential integrity step, the method `split_id` in utils.py takes this external reference and returns:
`{{data-partition-id}}:wks:master-data--Well`
This returned value is then passed along to the search service to look for the record's existence. Search returns nothing because that isn't a valid record id it is searching for. Subsequently, the dag logs a warning and never ingests the manifest because it "failed" the referential check.
The `split_id` method should return `{{data-partition-id}}:wks:master-data--Well:1000` like how it does for reference data records. The problem is found on these lines:
![image](/uploads/ec995d5b73ee360f3a614f36c3dc0283/image.png)
It is assuming that anything that is numbers at the end of a record id must be a version number, ignoring the position of those numbers. **This line of code needs to change to allow digits at the end of record ids.**
That first if condition you see above should catch this problem since the record id we're passing in has a trailing colon. However, this trailing colon is removed earlier in the process in the method `_extract_external_references`
![image](/uploads/1e971d49fe55ffb7ff5268cd4a84c6b2/image.png)
If you try to bypass this by removing the colon at the end in the manifest itself, the validation step throws an error and keeps you from ingesting the manifest. The only way pass this for now is to comment out the lines of code I've circled in the image above.Siarhei Khaletski (EPAM)Kateryna Kurach (EPAM)Siarhei Khaletski (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/57[POC] Install and investigate Airflow 2.0 [GONRG-2214]2022-08-23T11:19:11ZKateryna Kurach (EPAM)[POC] Install and investigate Airflow 2.0 [GONRG-2214]Install Airflow 2.0 and test:
Backward compatibility
Scheduler performance
Scale the web server by scaling the size of a node that the web server is using
Test Postgresql
Review other features that can improve performance
Link to GCP i...Install Airflow 2.0 and test:
Backward compatibility
Scheduler performance
Scale the web server by scaling the size of a node that the web server is using
Test Postgresql
Review other features that can improve performance
Link to GCP issue-tracking: https://jiraeu.epam.com/browse/GONRG-2214https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/20Master Data - Load manifest2022-08-23T10:47:24ZMeena RathinavelMaster Data - Load manifestM1 - Release 0.1Clifford PattersonJames O'BoyleRohit KurhekarClifford Pattersonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/13Perform ACL Check2022-08-23T10:47:24ZMeena RathinavelPerform ACL Check- Does my role permit me to add data?- Does my role permit me to add data?M1 - Release 0.1Clifford PattersonJames O'BoyleClifford Patterson