Manifest Ingestion DAG issueshttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues2021-02-09T23:17:24Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/10File structure updates. Airflow pluggable approach2021-02-09T23:17:24ZSiarhei Khaletski (EPAM)File structure updates. Airflow pluggable approach## Change Type:
- [x] Feature
- [ ] Bugfix
- [ ] Refactoring
## Context and Scope
Existing code base has a few disadvantages:
1. there is no any standardized approach to common modules close to the dags
2. it is not module struct, i.e....## Change Type:
- [x] Feature
- [ ] Bugfix
- [ ] Refactoring
## Context and Scope
Existing code base has a few disadvantages:
1. there is no any standardized approach to common modules close to the dags
2. it is not module struct, i.e. the code base can't be split on independent modules
This ADR proposes a vision how to make the project more plugable, i.e. this is an attempt to standardize the code base and vendors modules including.
Within the Decision section the next topics will be discovered:
1. Airflow project structure update
2. How to plug in the local python packages
3. Сaveats about Airflow internals
Furthermore, it should be noted the proposal implies two flows of improvement:
1. Strategic (late R3, post R3)
- Multiple API for deployments
* Operators (reusable components)
* DAG
* Libs
Developed by number of vendors libraries hosted on a platform. DAGs composing, for instance within UI and sends against API endpoint to be processed.
2. Immediate needs (R3)
- Single endpoint / approach for module code deployment
* DAGs
* Plugins
The second case is disclosed in the proposal bellow.
## Decision
#### Vendors contribution
Proposed approach will allow the next:
1. Each vendor can keep their code in separate repository
2. Vendors could contribute to core functionality
3. Vendors Ingestion extensions will be in a separate Git repository
This is following the steps above each vendor will can develop own extensions separately and just delivers when it needed.
The repositories can take the following representation:
~~~
/IngestionDAGs.git #ingestion core functionality
/Venror1.git
/Vendor2.git
~~~
Some caveats follows:
1. Extensions repositories must should proposed code structure (see above)
2. There is a list of supported libraries that should be updated by Operator. Version of libraries should be documented by CSPs
#### Code structure update
Our proposal is to split the current code base according to the next structure:
~~~
src/
├── dags/
│ ├── commons/
│ └── common_utils.py #for instance common functions to prepare DAG params/constants
│ ├── vendor_1/
│ │ ├── libs/
│ │ └── utils.py # the vendor utilities/functions
│ │ └── dag.py # the vendor DAGs here
│ └── vendor_2/
│ ├── libs/
│ ...
├── plugins/
│ ├── commons/
│ └── common_utils.py #for instance common functions to prepare operators params/constants
│ ├── vendor_1/
│ │ ├── libs/
│ │ └── utils.py # the vendor utilities/functions
│ │ ├── operators/ # the vendor operators here
│ │ ├── hooks/ # the vendor hooks here
│ │ └── ...
│ └── vendor_2/
│ ...
tests/
│ └── module (or vendor)
└── requirements.txt
~~~
Let's look deeper on the structure.
All the code will be split by modules or vendor folders. The folders will contain a separate libs and dags folders. The dags folder can do hold DAG- files and sub-folders with DAGs as well. Into libs folder can hold utils modules etc.
The tests will hold unit and integration tests that split by module or vendor.
The plugins folder will be split by modules or vendors too. Files from the directory have to follow the Airflow Plugins convention. We propose to use the next approach:
~~~
...
plugins/
└── vendor_1/
├── commons
└── vendor_utils.py
├── operators
└── vendor_operator.py
├── hooks
└── vendor_hook.py
├── macros
├── ...
└── __init__.py
~~~
#### Using of Airflow Plugins Mechanism
Airflow has a builtin plugins system that requires to create AirflowPlugin instances. This however, overcomplicates the issue and leads to confusion for many people. Airflow is even considering deprecating using the Plugins mechanism for hooks and operators going forward.
**(!) According to the document the Plugins mechanism still must be used only for plugins that make changes to the webserver UI.**
How it works:
Let’s assume you have an Airflow Home directory with the following structure.
**(!) We will assume that vendor name is vnd**
~~~
vnd/
├── commons
└── dags
└── vnd_dag.py
plugins/
└── vnd/
├── operators
└── vnd_operator.py
├── hooks
└── vnd_hook.py
├── sensors
└── vnd_sensor.py
└── __init__.py
~~~
The _vnd_dag_ wants to use _vnd_operator_ and _vnd_sensor_. Also, _vnd_operator_ wants to use **vnd_hook**. When Airflow is running, it will add _DAGS_FOLDER_, _PLUGINS_FOLDER_, and _config/_ to _PATH_. So any python files in those folders should be accessible to import. So from our _vnd_dag.py_ file, we can simply use
```python
from vnd.operators.my_operator import MyOperator
from vnd.sensors.my_sensor import MySensor
```
Since _plugins_ directory from a bucket root was added into _PATH_, therefore the imports above start from a vendor module name.
**(!) Due to internals of the Airflow it strongly not recommended to put many files into _dags/commons_, _plugins/commons_. We recommend to install that as a package by _pip_**
## Rational
Some of vendors provided their parsers. It was hard to be just plug-and-run. There were a lot of questions where to put the parsers, how to import and use for operators.
Because of an absent of any common approach and documentation, external modules can be cause to runtime errors.
## Consequences
1. MR with updated code base has to be created
2. README.md has to has information about the structure and conventions.Dmitriy RudkoDmitriy Rudko2020-10-20https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/113Manifest-based Ingestion - WPC does not ingest properly without Datasets in m...2024-03-21T06:31:51ZNorman MedinaManifest-based Ingestion - WPC does not ingest properly without Datasets in manifestIngesting a work product component record without the linked dataset records (via `Datasets[]` or a custom field) in the manifest, results in the WPC record not being ingested. This is even if the dataset records are already ingested bef...Ingesting a work product component record without the linked dataset records (via `Datasets[]` or a custom field) in the manifest, results in the WPC record not being ingested. This is even if the dataset records are already ingested before in OSDU. Furthermore, the airflow logs does not show any errors regarding this. However, the WPC record disappears from the `XCOM` at the `provide_manifest_integrity_task`.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/112Osdu_Ingest - fails to detect issue with GeoJSON structure in Spatial block2024-01-22T15:10:15ZDebasis ChatterjeeOsdu_Ingest - fails to detect issue with GeoJSON structure in Spatial blockEnvironment M22/Azure/Preship
Osdu_Ingest run ID = aaf9eaa0-beec-421f-8cbe-f87f6373ed4e
['opendes:master-data--SeismicAcquisitionSurvey:DC19JAN']
Input JSON payload has some issue with GeoJSON part.
validate_manifest_schema_task show...Environment M22/Azure/Preship
Osdu_Ingest run ID = aaf9eaa0-beec-421f-8cbe-f87f6373ed4e
['opendes:master-data--SeismicAcquisitionSurvey:DC19JAN']
Input JSON payload has some issue with GeoJSON part.
validate_manifest_schema_task shows clean log.
Later the record gets created.
When we check record using Storage, it shows Spatial data (WGS84 coordinates).
Search does not show that information any more.
Troubleshooting reveals that Spatial data is not indexed since GeoJSON syntax is incorrect.
"geo-json shape parsing error: must be a valid FeatureCollection attribute: SpatialLocation.Wgs84Coordinates",
The question - why did "schema validation" step not detect the problem and stop record creation?
cc @Yan_Sushchynski
See enclosed file with additional information.
[2024-01-19-EDS-Seismic.txt](/uploads/4175777071cda54bd465b41eb0264f30/2024-01-19-EDS-Seismic.txt)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/111is_batch option still supporting2023-11-14T17:13:24ZBruce Jinis_batch option still supportingIn `manifest_ingestion` dags, we have 2 payload types, which leads to 2 flows, one is batch upload, another one is 3 step with some validations with schema and integrity.
My question is, are we still using the batch_upload option on the ...In `manifest_ingestion` dags, we have 2 payload types, which leads to 2 flows, one is batch upload, another one is 3 step with some validations with schema and integrity.
My question is, are we still using the batch_upload option on the left?
Since we are not checking for schema and integrity on that flow, and we have option to split 1 large file in batches in `processing_single_manifest_file_task![Screenshot_2023-11-13_at_3.24.18_PM](/uploads/8a9242b16809d5dcaf1237c4c9152049/Screenshot_2023-11-13_at_3.24.18_PM.png).https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/110Provide suitable clue (error message) when the batch size is large and no rec...2023-10-19T15:42:06ZDebasis ChatterjeeProvide suitable clue (error message) when the batch size is large and no records are processed/createdThe JSON payloads are clean. Actually, for my test cases these were JSON files from OSDU Reference data.
And Policy service is not enabled in my OSDU instance.
I run the payload the first time. It fails to create records.
Checked all th...The JSON payloads are clean. Actually, for my test cases these were JSON files from OSDU Reference data.
And Policy service is not enabled in my OSDU instance.
I run the payload the first time. It fails to create records.
Checked all the log files and they are all clean. So, that is very misleading.
I then split the initial payload into smaller pieces and then the job goes through smoothly.
What we need is a clear error message such as "in your current System configuration, it is not possible to handle payload of size larger than XXX".https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/107Process manifest task of Osdu_ingest DAG calls the Storage Service for the Da...2022-12-12T14:43:20ZMeghnath SahaProcess manifest task of Osdu_ingest DAG calls the Storage Service for the Dataset Id in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record versionIt has been observed that the process manifest task of Osdu_ingest DAG calls the Storage Service for the DatasetId in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record...It has been observed that the process manifest task of Osdu_ingest DAG calls the Storage Service for the DatasetId in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record version. Also, in dataload_r3.py the FileId updated in WorkProductComponent is **file_id:file_version**. As result following the challenges are encountered.
1. If WorkProuct and WorkProuctComponent are not processed by Airflow Manifest Ingestion DAG due to failure in referential integrity validation then the file source information used in first attempt cannot be used for reprocessing the manifest because the file version in the file source json is not the latest anymore and validation of referential integrity fails when reused. As a result ingestion of WPC is tightly coupled with upload of Datasets which generates File Source information used to replace the surrogate key in the WPC manifest by the dataload_r3.py script.
2. open-test-data/rc--3.0.0/4-instances/TNO/work-products/markers/*.json and open-test-data/rc--3.0.0/4-instances/TNO/work-products/markers_1_1_0/*.json are using same dataset. Similarly, open-test-data/rc--3.0.0/4-instances/TNO/work-products/'well logs'/*.json and open-test-data/rc--3.0.0/4-instances/TNO/work-products/'well logs_1_1_0'/*.json. Same is the case with the manifests for Volve. Because of the current behavior of DAG and dataload_r3.py described above, same File Source information generated from the upload of Datasets cannot be reused. As a workaround, the dataset files s3://osdu-seismic-test-data/r1/data/provided/markers/ are copied into two different directories markers and markers_1_1_0 so that the files are uploaded separately generating unique FileId. Similarly for well logs.
@anujgupta @shrikgar @sukanta.bhattacharjee @aparial FYIhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/106Manifest Ingestion by Reference - point to a large set of identical files2022-05-10T14:47:01ZDebasis ChatterjeeManifest Ingestion by Reference - point to a large set of identical filesDiscussed with Jean Francois Rainaud recently.
Such as collection has identical manifests for different records, such as for 5000 TNO Wellbores.
https://community.opengroup.org/osdu/platform/data-flow/data-loading/open-test-data/-/tree...Discussed with Jean Francois Rainaud recently.
Such as collection has identical manifests for different records, such as for 5000 TNO Wellbores.
https://community.opengroup.org/osdu/platform/data-flow/data-loading/open-test-data/-/tree/master/rc--3.0.0/4-instances/TNO/master-data/Wellbore
It is feasible to make use of **File Collection** type Dataset.
https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/dataset/FileCollection.Generic.1.0.0.md
The program can point to a Dataset record which is file collection and handles processing of all 5000 records.
Thus there can be two alternatives for the new program (Manifest Ingestion by Reference) – one with large (concatenated) JSON file and the other with “collection”.
This thought is actually triggered by user feedback (see below).
Manifest Ingestion Issues:
1. While ingesting a set of a batch files, the files picked up by the script and invoking the DAG.
a. DAG has a limitation to run only 32 concurrent runs. Hence python scripts trigger 100 files, it is taking only 32 at a time, and once the job finishes, it picks up the other one.
b. During concurrent runs, some of the DAGs fail, but Airflow still shows as success, which is the pain area to identify the unsuccess file unless the customer reported that the file did not ingest properly.
2. The TNO dataset takes almost 2-3 hrs to ingest (~5000 wells), and we are concerned with the massive volume of (TB) data and how many days it ll take to ingest it. Therefore, the performance of ingestion need to improve more.
Regards,
Jegan (Accenture)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/102Manifest-based Ingestion - Xcom summary2022-03-15T18:32:00ZDebasis ChatterjeeManifest-based Ingestion - Xcom summaryEmerged when discussing with @fhoueto.amz and @spencer earlier today -
For Ingestion of JSON payload with large number of records, it would be useful to obtain overall summary count of success and failure.
Xcom summary showed created-I...Emerged when discussing with @fhoueto.amz and @spencer earlier today -
For Ingestion of JSON payload with large number of records, it would be useful to obtain overall summary count of success and failure.
Xcom summary showed created-ID's and failed-IDs. That is good. But if the list is long, it is hard to keep track of what worked and what failed.
So, in addition, we are requesting summary such as -
Attempting to input '50' Master Data "Well" records.
Succeeded = 40
Failed = 10
cc - @chad , @Devendra_R , @epeysson for informationhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/101Performance review of main ingestion functions' improvements from M6 to M102022-08-11T20:42:13ZYan Sushchynski (EPAM)Performance review of main ingestion functions' improvements from M6 to M10<details><summary>GCP results - Click to expand</summary>
I was testing the main functions of Manifest Based Ingestion on my local machine from M6 to M10 releases.
Results are provided in the following table.
| Function ...<details><summary>GCP results - Click to expand</summary>
I was testing the main functions of Manifest Based Ingestion on my local machine from M6 to M10 releases.
Results are provided in the following table.
| Function | Manifest | M10_optimized (sec) | M9 (sec) | M8 (sec) | M7 (sec) | M6 (sec) |
|-----------------------------------------------|------------------------------|---------------------|----------|----------|----------|----------|
| schema_validator.ensure_manifest_validity | LogCurveType (42917 records) | 113 | 1453 | 1453 | 1453 | 1453 |
| | LogCurveType (800 records) | 2.6 | 25.38 | 25.38 | 25.38 | 25.38 |
| | WorkProduct | 2.5 | 2.685 | 2.895 | 2.895 | 2.895 |
| manifest_integrity_validator.ensure_integrity | LogCurveType (42917 records) | 14.94 | 15.07 | 14.67 | 40.2 | ** |
| | LogCurveType (800 records) | 5.494 | 4.677 | 5.82 | 5.141 | 3751 |
| | WorkProduct | 0.0013 | 0.001 | 0.001446 | 0.001852 | 0.001781 |
| single_manifest_processor.process_manifest | LogCurveType (42917 records) | | 2056* | ** | ** | ** |
| | LogCurveType (800 records) | | 43.18* | 439.3 | 439.3 | 439.3 |
| | WorkProduct | | 2.544 | 2.454 | 2.887 | 2.6 |
*_Sent batches of 400 records to Storage Service_
**_Can't execute this test for reasonable time (it may last more than 24h)_
### Performance improvements throughout M6-M10 releases.
#### M10 (?)
After analyzing the previous releases, some bottlenecks were found.
The slowest part of Manifest Ingestion, besides Process Manifest, was Schema Validation. After some researches, it was found that a common way of using `jsonschema.validate` has a lot of overhead with creating classes and instances of `validators` on each schema validation.
The solution was to create `jsonschema.validators` on each unique schema one time and reuse them against corresponding records. This approach is roughly 10 times faster than the usual one of `jsonschema.validate`.
E.g., `M9 Schema Validation` of 42917 LogCurveType records was _1453_ seconds, and it is _113.1_(!) seconds on `M10` release.
#### M9
In the previous releases, each Manifest's record was saved in Storage Service one by one, this cased a lot of requests to Storage.
After adding `Storing Manifest's records` with using Storage Service's Batch Saving (up to 500 records), it is possible to avoid extra requests to Storage.
E.g., `M8 manifest processing` of 800 LogCurveType records took _439_ seconds, meanwhile `M9 manifest processing` with batches of **400 records** took _43_ seconds.
#### M8
Improved `Manifest Integrity Validation` performance by sending batches of external OSDU Ids of all Manifest's records to Search Service. Before, these Ids were searched one by one; this caused extra calls of Search Service.
E.g., `M7 manifest integrity check` of 42917 LogCurveType records took _40.2_ seconds, meanwhile `M8 manifest integrity check` of the same Manifest took _14.67_ seconds.
#### M7
Improved `Manifest Integrity Validation` performance by extracting all external references in OSDU Search Service into a single set of unique Ids, and only then they are searched. This significantly reduced a number of requests to Search Service; earlier, each Manifest record's external references were searched separately, this caused calling Search Service with the same requests many time.
E.g., `M6 manifest integrity check` of 800 LogCurveType records took _3751_ seconds, meanwhile `M7 manifest integrity check` of the same Manifest took _5.141_ seconds.
</details>
<details><summary>AWS results - Click to expand</summary>
#M12
* Manifest by Reference implementation in validation and integrity check stages performs only marginally faster than current implementation. If the ADR’s design of adding a POST request to the stage was accepted, these marginal improvements might actually be slower.
* The reference implementation showed a 9x performance decrease over existing implementation for the process_manifest step.
| Function | Manifest | Original Manifest Ingestion (avg sec) | Manifest By Reference (avg sec) |
|-----------------------------------------------|----------------|---------------------------------------|---------------------------------|
| schema_validator.ensure_manifest_validity | 4kb Manifest | 2.85 | 3.12 |
| | 128kb Manifest | 6 | 6 |
| | 4mb Manifest | 6.12 | 5.9 |
| manifest_integrity_validator.ensure_integrity | 4kb Manifest | 2.85 | 2.98 |
| | 128kb Manifest | 4 | 3.5 |
| | 4mb Manifest | 4.27 | 4 |
| single_manifest_processor.process_manifest | 4kb Manifest | 2.73 | 24.7 |
| | 128kb Manifest | 4 | 25 |
| | 4mb Manifest | 2.9 | N/A** |
| Total time | 4kb Manifest | 8.43 | 30.8 |
| | 128kb Manifest | 14 | 34.5 |
| | 4mb Manifest | 13.29 | N/A** |
</details>Siarhei Khaletski (EPAM)Yan Sushchynski (EPAM)Aleksandr Spivakov (EPAM)Siarhei Khaletski (EPAM)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/92Manifest-based Ingestion - add persistable reference information by using con...2022-12-09T10:18:24ZDebasis ChatterjeeManifest-based Ingestion - add persistable reference information by using content from Reference dataCurrent approach - specify persistable reference inside load manifest (JSON file).
We propose that Manifest Ingestion process can get the information from Reference data and add that row "on the fly" just before using Storage Service to ...Current approach - specify persistable reference inside load manifest (JSON file).
We propose that Manifest Ingestion process can get the information from Reference data and add that row "on the fly" just before using Storage Service to create new record. This way, Data Loader does not need to worry about adding this information by himself or herself.
cc - @Keith_Wall , @ChrisZhang , @epeysson , @Kateryna_Kurach, @chad , @blasscoc (for information)
```
"meta": [
{
"kind": "Unit",
"name": "ms",
"persistableReference": "{\"abcd\":{\"a\":0.0,\"b\":0.001,\"c\":1.0,\"d\":0.0},\"symbol\":\"ms\",\"baseMeasurement\":{\"ancestry\":\"T\",\"type\":\"UM\"},\"type\":\"UAD\"}",
"unitOfMeasureID": "{{data-partition-id}}:reference-data--UnitOfMeasure:ms:",
"propertyNames": [
"RecordLength"
]
},
```
For example, see this retrieval query.
Body
```
{
"kind": "{{data-partition-id}}:wks:reference-data--UnitOfMeasure:1.0.0",
"limit":5000,
"query": "id: \"{{data-partition-id}}:reference-data--UnitOfMeasure:ms\"",
"returnedFields": ["*"]
}
```
Response
```
{
"results": [
{
"data": {
"AttributionPublication": "Energistics Unit of Measure Dictionary V1.0",
"PersistableReference": "{\"abcd\":{\"a\":0.0,\"b\":0.001,\"c\":1.0,\"d\":0.0},\"symbol\":\"ms\",\"baseMeasurement\":{\"ancestry\":\"T\",\"type\":\"UM\"},\"type\":\"UAD\"}",
"InactiveIndicator": false,
"UnitDimensionCode": "T",
"UnitQuantityID": "odesprod:reference-data--UnitQuantity:T:",
"Code": "ms",
"Source": "Workbook Published/UnitOfMeasure.1.0.0.xlsx; commit SHA c1d72417.",
"Name": "millisecond",
"AttributionAuthority": "Energistics",
"IsBaseUnit": false,
"AttributionRevision": "1.0",
"ID": "ms",
"CoefficientC": 1.0,
"UnitDimensionName": "time",
"CoefficientD": 0.0,
"CoefficientA": 0.0,
"CoefficientB": 0.001
},
"kind": "odesprod:wks:reference-data--UnitOfMeasure:1.0.0",
"source": "wks",
"acl": {
"viewers": [
"data.default.owners@odesprod.osdu-gcp.go3-nrg.projects.epam.com"
],
"owners": [
"data.default.owners@odesprod.osdu-gcp.go3-nrg.projects.epam.com"
]
},
"type": "reference-data--UnitOfMeasure",
"version": 1617286097841123,
"createTime": "2021-04-01T14:08:17.885Z",
"authority": "odesprod",
"namespace": "odesprod:wks",
"legal": {
"legaltags": [
"odesprod-demo-legaltag"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"createUser": "osdu-sa-airflow-composer@osdu-service-prod.iam.gserviceaccount.com",
"id": "odesprod:reference-data--UnitOfMeasure:ms"
}
],
"totalCount": 1
}
```
Enclosed is end-to-end example of Unit Conversion (Frame of Reference) in GCP, R3M8 by using manifest-based Ingestion process.
[OSDU_PTP_M8_TeamA_GCP-Manifest-Ingestion-unit-convert-Debasis-Naufal.txt](/uploads/73b4e9e586c9c7649d8581fd84e86100/OSDU_PTP_M8_TeamA_GCP-Manifest-Ingestion-unit-convert-Debasis-Naufal.txt)https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/90ADR: Airflow DAG run output to Workflow service2023-07-05T10:09:41ZYan Sushchynski (EPAM)ADR: Airflow DAG run output to Workflow service
## Introduction
The DAG consists of several steps. The manifest’s entities (Master-data records, Reference-data records etc.) can be marked as skipped and extracted from the original manifest on each step, if they don’t pass validatio...
## Introduction
The DAG consists of several steps. The manifest’s entities (Master-data records, Reference-data records etc.) can be marked as skipped and extracted from the original manifest on each step, if they don’t pass validations (e.g., they don’t follow proper schemas, have inconsistent data, can’t be stored in Storage due to wrong ACL).
At the current moment skipped entities’ info of each step is stored in XComs. The valid manifest without skipped records is also passed to the downstream step in `return_value` field.
![image](/uploads/9baf17a02c1924be8ac5ebe188b274f9/image.png)
Aggregated result grouped by each previous task is stored in `update_status_finished_task`’s XCom.
![image](/uploads/9facb75a9cb3b3c5dd015f5d9ee76d45/image.png)
This approach lets us quickly see what is skipped on each step of the DAG. However, to see the skipped ids, the user must visit Airflow UI and must know the exact `dag_run ID` of each workflow. Also, it gets difficult to keep track of skipped records if the Manifest contains hundreds or even thousands of records. The most prominent example is ingesting `LogCurveType` Reference-data (https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/reference-values/-/blob/master/Manifests/reference-data/LOCAL/LogCurveType.1.0.0.json) with tens thousands of records.
As the end user must interact with Workflow Service only and mustn’t be aware of Airflow Ingestion Engine, the mechanism of notifying Workflow Service about current statuses of each task must be developed.
Anyway, we must extend existing Workflow Service’s API.
Relates to https://community.opengroup.org/osdu/platform/system/home/-/issues/87
## Proposals
### Get XCom outputs through Workflow Service using Stable API.
This is the most straightforward way that doesn’t need to change the DAGs’ implementations. Stable API lets the user get XCom entries on concrete `dag_runs` and `task_ids`.
So, when we request the workflow run status, Workflow Service can get the XCom entry, which contains aggregated report from `update_status_finished_task` step.
The link to the method description for getting XCom is here:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_xcom_entry
![image](/uploads/aaf0c6522999d8f947984d4450f88747/image.png)
Possible response on `/v1/workflow/{workflow_name}/workflowRun/{runId}`:
```json
{
"workflowId": "string",
"runId": "string",
"startTimestamp": 0,
"endTimestamp": 0,
"status": "FINNISHED",
"report": {
"saved_record_ids": {},
"skipped_ids": {
"process_single_manifest_file_task": [
{
"id": "<data-partition-id>:work-product-component--SeismicBinGrid:Auto_Test_999493329875",
"kind": "work-product-component--SeismicBinGrid",
"reason": "400 Client Error: Bad Request for url."
}
]
}
},
"submittedBy": "string"
}
```
Pros:
- Easy to implement.
- Doesn’t require changes in Airflow Operators.
- Little changes in Workflow Service.
Cons:
- XComs can be huge, that can cause problems with sending such messages via HTTP and reading them.
### Save status in Dataset/File Service as files
Another approach is not to pass skipped ids via XComs but save skipped entities report as a file using Dataset Service.
We can add new endpoints for Workflow Service to update the current workflow run status, so, Airflow will send report to it on each task. At the same time, Workflow Service will create a new dataset by uploading the report as a file to Dataset Service.
After the workflow is finished, Workflow Service will show the datasets' `ID` of the datasets with `FileSource` of the report in the workflow run. Then, it will be possible to get `signedURL` of the report files using retrieval instructions of Dataset/File Service.
Also, saving reports as files in Dataset Service can be implemented from Airflow side, then Workflow Service will get `Dataset IDs` only.
![image](/uploads/534b547631f4bc72e4510c3b2f47faba/image.png)
Possible response on `/v1/workflow/{workflow_name}/workflowRun/{runId}`:
```json
{
"workflowId": "string",
"runId": "string",
"startTimestamp": 0,
"endTimestamp": 0,
"status": "FINNISHED",
"report": {"task_id": "<dataset-id-of-report>"},
"submittedBy": "string"
}
```
Pros:
- Save large reports in Dataset Service not in XComs.
- Follow “Manifest as a reference” approach.
- Using Dataset Service is not CSP-specific.
- User can get required report as a file from Dataset Service using the dataset-id only.
Cons:
- Requires extra changes in Workflow Service.
- Need to think how to display the report if they are stored as files in dataset service.
- All records in Dataset Service are indexed.
- We need to consider creating a specific kind for reports.
### Save status in Cloud Storage with signed URL
Another approach is to use `signed URL` from the Cloud Storage. Airflow Operators can request Workflow Service(?) for a new file location and `signed URL` in some Landing Zone (bucket). The report of skipped ids can be stored by this signed URL.
Every task will request a signed URL and “real” location of the Storage object (e.g., `gs://…`, `s3://…`) at the start of its executions. Each task will use the `signed URL` to save intermediate data. We can’t communicate signed URLs, because they have expiration date. That’s why tasks XCom’s content will consist of Storage paths. The final task will collect all XComs from the upstream tasks and provide aggregated report grouped by upstream tasks.
When the user asks for the `workflow_run` status, Workflow Service must be able to convert cloud-storage paths into `signed URLs` to make reports accessible for users.
![image](/uploads/d6473c29882821b28037a6574968798c/image.png)
Possible response on `/v1/workflow/{workflow_name}/workflowRun/{runId}`:
```json
{
"workflowId": "string",
"runId": "string",
"startTimestamp": 0,
"endTimestamp": 0,
"status": "FINNISHED",
"report": {"task_id": "<signed-url-to-report>"},
"submittedBy": "string"
}
```
Pros:
- We use separate Landing Zone instead of Dataset Service not to litter it with transient data.
- Reduce XComs’ content.
Cons:
- Need to implement significant changes in Workflow Service and add extra responsibilities to it.
- Working with Landing Zone is cloud-specific.
- Probably, we'll need to introduce a new service to save report Landing Zone and get them.
## Open OSDU Architectural questions.
1. We need to discuss the structure of the report.
- What fields of records must be present in the report?
- What kind of information must be in this report?
- How can we make this report readable for users?
2. For now, it is hard to identify the record if it doesn’t have system-generated or unique ID.
3. Also, we need to think about access to the workflow results depending on `ACL` of records. That means that the user is allowed to get the report about the skipped records only with corresponding `ACL`. Also, there is a situation when the same report will contain records with different `ACLs`.etienne peyssonetienne peyssonhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/84Manifest-based Ingestion - check for coherent value in triplet - wpc WellLog ...2022-08-23T10:50:59ZDebasis ChatterjeeManifest-based Ingestion - check for coherent value in triplet - wpc WellLog (LogCurveType and parents tree)https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/E-R/work-product-component/WellLog.1.0.0.md
data.Curves[].LogCurveMainFamilyID
data.Curves[].LogCurveFamilyID
data.Curves[].LogCurveTypeID
Inte...https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/E-R/work-product-component/WellLog.1.0.0.md
data.Curves[].LogCurveMainFamilyID
data.Curves[].LogCurveFamilyID
data.Curves[].LogCurveTypeID
Integrity check needs to detect incoherent value, if provided by Data Loader by mistake.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/83Manifest-based Ingestion - add check for coherent values in pairs (ex: Play a...2022-08-23T10:51:00ZDebasis ChatterjeeManifest-based Ingestion - add check for coherent values in pairs (ex: Play and PlayType in Master entity "Field")Airflow Console - It shows the step of provide_manifest_integrity_task.
![Thomas-Master-data-Field](/uploads/e3c3414784505cca209c6547270f568a/Thomas-Master-data-Field.PNG)
In this step, check is performed to see if referenced value (for...Airflow Console - It shows the step of provide_manifest_integrity_task.
![Thomas-Master-data-Field](/uploads/e3c3414784505cca209c6547270f568a/Thomas-Master-data-Field.PNG)
In this step, check is performed to see if referenced value (for Reference entity or for Master entity) actually exists.
Ex: Master entity Organisation -> Reference entity OrganisationType
Ex: Work-product Component WellLog -> Master data Wellbore
Ex: for Master data "Organisation"
`"OrganisationTypeID": "{{data-partition-id}}:reference-data--OrganisationType:BogusValue:",`
Portion from log file showing the reason for failure –
```
[2021-07-30 14:24:40,991] {validate_referential_integrity.py:188} WARNING - Resource with
kind odesprod:wks:master-data--Organisation:1.0.0 and
id: 'odesprod:master-data--Organisation:Katalyst30Jul1' was rejected.
Missing ids '{'odesprod:reference-data--OrganisationType:BogusValue:'}'
```
It is in this step that I propose addition of consistency check for “paired values” in Master entity “Field” – ex: Play and PlayType.
> https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/E-R/master-data/Field.1.0.0.md
Ex: Master Entity "Play" has a record “Play-Thomas1” which is of PlayType=”Shale”.
But the Data Loader is mentioning incoherent combination when creating a new record of "Field" -
play=”Play-Thomas1” but PlayType=”OilSands” instead of “Shale”
We propose detection of such inconsistency during manifest-based Ingestion.
![Thomas-Master-entity-Field](/uploads/f22192aac578abcf490a3e237f24dbd0/Thomas-Master-entity-Field.jpg)
cc : @gehrmannhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/70Manifest by reference - Create an operator to push/pull manifests based on re...2021-06-30T12:27:27ZBen LasscockManifest by reference - Create an operator to push/pull manifests based on record idIn the proposed system, only the manifest "record id" will be propagated by through Airflow using XCOM. In this issue, develop a method to obtain a manifest using the dataset service given a "record id" (which might be a signed url to a ...In the proposed system, only the manifest "record id" will be propagated by through Airflow using XCOM. In this issue, develop a method to obtain a manifest using the dataset service given a "record id" (which might be a signed url to a file).https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/69Manifest by reference - Use dataset service to move the manifest to the stora...2022-01-18T04:12:23ZBen LasscockManifest by reference - Use dataset service to move the manifest to the storage area.Manifest by reference requires a method to move the manifest from the landing zone to temporary storage on the platform. We propose using the dataset service for this, with the expectation that the dataset service can move the file and r...Manifest by reference requires a method to move the manifest from the landing zone to temporary storage on the platform. We propose using the dataset service for this, with the expectation that the dataset service can move the file and return a signed url for use by the ingestion workflow. This URL will be communicated to the workflow service (by a POST request).https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/67ADR : Manifest by reference2023-12-19T09:53:49ZBen LasscockADR : Manifest by reference## Status
* [x] Proposed
* [x] Trialing
* [x] Under review
* [x] Approved
* [ ] Retired
# Context
A lack of performance of the first implementation of the Ingestion_Dags_R3 was notified by the pre-shipping team A and confirmed by test...## Status
* [x] Proposed
* [x] Trialing
* [x] Under review
* [x] Approved
* [ ] Retired
# Context
A lack of performance of the first implementation of the Ingestion_Dags_R3 was notified by the pre-shipping team A and confirmed by tests realized by EPAM in early April 2021.
In early April 2021, Pre-shipping team A and EPAM identified two important issues with the performance of the manifest based ingestion using Airflow 1.10.15. First, there was found to be a limit on the maximum size of the manifests [1-3] (up to 4Mb) which limited the utility of the application. And second, there have been questions about the likely scalability of the ingestion given the architectural design of the Airflow system [?]. Other issues, the minimum time required to ingest a single record, and the limitation of submitting one DAG per second, has also been identified but is not addressed by this ADR (these will be alleviated by a move to Airflow 2.0).
[1] https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/64
[2] https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/98
[3] https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/99
[4] Feedback from our weekly PMC meetings.
### Apache Airflow and XCOM
As we see in Fig. 1, the manifest is initially posted to the ingestion service (DAG) and then flows through operators within the DAG. At each node in the graph (an operator), the manifest must be loaded/saved to/from XCOM (see appendix for definitions).
In a manifest based ingestion workflow, the operators apply transformations to the manifest itself, which might include actions like validation (removing invalid records), replacing surrogate keys etc. For a general use case, if a manifest is initially large, it can remain large throughout. This creates two problems, first, there is an overhead for serializing and deserializing the manifest at each operator. Second, (see the appendix), the size of objects that can be stored in XCOM is dependent on the database being used in the backend, resulting in issues [1-3]. The “max manifest size” of the application also becomes strongly platform dependent if we rely on XCOM to store manifests. Moreover, Airflow is a workflow orchestration tool, not a processing engine like Apache Spark; Airflow is not optimized to manage big data, and so from an architectural standpoint, we need to change how we use XCOM.
![XCOM-existing](/uploads/d2a65e2706243c29786bce79c8ace249/XCOM-existing.png)
The solution envisaged is: rather than pass the full manifest payload to the Ingestion Workflow, pass a pointer (via dataset id, or some other construct) to the manifest to be processed.
# Proposal
The purpose of this project is to improve the performance of the manifest based ingestion, by reducing the overhead of passing (potentially large) manifests through an https interface, when such an operation is not needed. This approach is also intended to better handle larger manifests for manifest ingestion.
This project MUST be conducted after a necessary evolution of the overall development environment which have to take in count the evolution to AIRFLOW 2.0. and can change the process used today to envisage the ingestion workflow by adding a preparation phase before reusing the actually developed operators.
On a first pass (in March) the following behavior was proposed:
1. In this approach, a user or process would create one or more manifests (could be done using tooling)
2. The user or process then uploads the manifests to OSDU using the Dataset service (e.g., get storage instructions, upload file, store metadata record, get the record id)
3. Invoke the workflow service passing in on or more record ids that point to the uploaded manifest(s)
4. Create a DAG operator capable of fetching the manifest(s) from storage using the Dataset service (get retrieval instructions)
Then, in late May we (GEOSIRIS) presented the following diagram :
![XCOM-Updated](/uploads/fe3e68cc2d3c06dfd86da11d1ea6a57f/XCOM-Updated.png)
On this proposal, the Airflow Context still exits but contains only a reference to a storage containing the full manifest. We could set up (or reuse) a method before starting the DAG, which use a POST from Postman to load the manifest in this storage area (from Landing Zone with Dataset Service).
Then, we will obtain a manifest ID which could be send from Postman to the first Operator. The interaction by XCOM to the Airflow context will remain but will be only containing the manifest Id. The operator will then use this id to POST an GET the full manifest from this storage area by a “Dataset Service” and at the end, if necessary, the manifest could be deleted from this storage area.
In this situation the manifest is not only read but also written in the storage area with the “dataset service”. Currently we are not persisting manifests [in the platform], and the consensus is that, by default, we should not be storing the manifests after the records are ingested. We can discuss the interest to be able to record the manifests updated versions produced by the ingestion workflow service but this could be out of this scope.
If it is decided that storing the manifests is in scope we should keep two versions of each manifest at least in the dataset area, the original manifest and the manifest post processing.
### Scope
In scope:
1. Develop an efficient method to POST the manifest to/from the landing zone and operators ( by using Dataset Service)
2. Demonstrate how existing operators should use this feature.
3. (optionally) store the final manifest in the persistent zone.
Out of scope:
1. Re-implement existing operators to use this feature.
2. Storing of status of ingested records for logging with the AdminUI.
### Milestones
1. Build up locally a complete development environment based on the existing one after the migration from AIRFLOW 1.10.15 to AIRFLOW 2.0.1 by the EPAM team. This will allow to implement a test a cloud agnostic solution which could be delivered to the EPAM team and others.
2. Enhance the Preparation and Loading services to provide a way to Record an original manifest by the dataset service as a file on which an URL is attached. This URL will be communicated to the workflow service (by a POST request).
As we will have to provide this capability, it could be fine if we can impose that, during the preparation and loading step, the other files stored by the dataset service (example LAS, LIS, and WITSML) has to be realized before storing the original manifest in order to ensure that the referenced URLs for these file are correctly associated to the manifest, before the Ingestion workflow.
At the End of the preparation and loading service completion each manifest will be associated with an URL.
An other Enhancement of the preparation and Loading could be envisaged (presented at the end).
3. The new process instead of directly push/pull a full manifest with xcom inside Airflow’s running context to the ingestion workflow will communicate the record id of a manifest (instead of the full manifest).The first operator gets the full manifest by using the dataset service from the record id it received. Some modifications may be done on the manifest, and then the operator store the new version on the dataset service. It also obtains a new record id for this updated manifest. This ID is then given to the following operator, by pushing it in workflow context with xcom. Then, after each step on which an operator is updating this manifest, a new record id is requested to store the new version of the manifest. This new version is then given to the following operator by sending the new record id.
If an operator needs to access to all manifests created during the workflow, instead of only saving the last updated manifest record id in the Airflow context with xcom, operators could save the list of all record id received plus its updated manifest record id. This corresponds to the representation done on the figure “New approach” above. This could be interesting to implement an operator that verify all results of all other operators for example. During all the Ingestion Workflow all the manifests updated will be kept “in life” by the dataset Service in a data lake. Depending on the final decision, at the end of the process (after all operators has successfully processed), a “cleaning” operator could be added to remove all manifests created by this ingestion workflow. Or.. The last one .. which RecordID will be send to the storage service is kept with the original one.
4. Optional enhancement: Proposition of a more important development of the preparation and Loading step.
An idea to generalize this idea to deliver only one RecordID to an ingestion Workflow able to manage a lot of manifest (in batch) could be to define before launching this workflow services a standard way to create a “manifest file package” file containing the references to the manifest files names and their location in a datalake :
This Manifest_File_Package will be communicated to the Workflow package.
example of a Manifest_File_Package:
{
"kind": "osdu:wks:Manifest_File_Package:1.0.0"
"Datasets":
"data":
{"DatasetProperties": {
"FileSourceInfo": {
"FileSource": "",
"Name": "load_log_1051_ana01_1962_comp.json ",
"PreloadFilePath": "s3://osdu-manifest- provided/load_log_1051_ana01_1962_comp.json "
}
}
"data":
{"DatasetProperties": {
"FileSourceInfo": {
"FileSource": "",
"Name": "load_1067_aps01_1971_comp.json ",
"PreloadFilePath": "s3://osdu-manifest- provided/1067_aps01_1971_comp.json "
}
}
}
The Ingestion workflow will need to access to the following urls : [](s3://osdu-manifest- provided/load_log_1051_ana01_1962_comp.json)
In this case all manifest will be accessible in the datalake independently and communicated one by one to the Ingestion workflow from the RecordID delivered in the Dataset properties of the manifest_File_Package delivered to the ingestion Workflow (we will have only on POST request for processing a lot of manifests).
### Appendix
Apache Airflow [docs](https://airflow.apache.org/docs/apache-airflow/1.10.1/concepts.html?highlight=xcom):
An operator describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators….
… This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described elsewhere in this document.
XCOM’s let tasks exchange messages, allowing more nuanced forms of control and shared state. The name is an abbreviation of “cross-communication”.... Any object that can be pickled can be used as an XCOM value, so users should make sure to use objects of appropriate size.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/64[Manifest-based] Deferred integrity check (ELT) and/or Current implementation...2021-06-23T17:42:01ZKateryna Kurach (EPAM)[Manifest-based] Deferred integrity check (ELT) and/or Current implementation (ETL)There is a situation when the Manifest contains entities that have links inside to other pieces of data, these links can refer either to entities inside the Manifest or to already stored on OSDU ones. At the current implementation we che...There is a situation when the Manifest contains entities that have links inside to other pieces of data, these links can refer either to entities inside the Manifest or to already stored on OSDU ones. At the current implementation we check entities’ integrity during Manifest based ingesting, and it can take a lot of time to check every entity’s reference to other ones. Also, there is a problem when the ingested entity doesn’t have the unique id or has the surrogate key, this causes issues with identifying skipped due to inconsistency entities. The solution may be to store entities as they are, get unique OSDU ids, replace surrogate keys with real ids, then start background DAG that will check data consistency of each record. For sure, the mechanism of setting current status of records (consistent, not consistent, not verified) must be invented.
This solution has to be discussed in more details.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/59Develop log formatter for Airflow logs2021-04-28T15:31:18ZKateryna Kurach (EPAM)Develop log formatter for Airflow logsWe had an OSDU community sync whether the problem of log formatting was discussed. It was decided that repository will be created where developers will provide formatters and handlers for standard 'logging' python library.
https://commun...We had an OSDU community sync whether the problem of log formatting was discussed. It was decided that repository will be created where developers will provide formatters and handlers for standard 'logging' python library.
https://community.opengroup.org/osdu/platform/system/lib/core/python-core-common - repositoryhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/58ADR: General purpose batch write DAG operator2023-07-05T10:09:41ZAlan HensonADR: General purpose batch write DAG operator## Status
- [X] Draft
- [ ] Proposed
- [ ] Trialing
- [ ] Under Review
- [ ] Approved
- [ ] Retired
## Context
There are a wide variety of volume-based use cases that drive how ingestion with the OSDU(TM) data platform will occur. The ...## Status
- [X] Draft
- [ ] Proposed
- [ ] Trialing
- [ ] Under Review
- [ ] Approved
- [ ] Retired
## Context
There are a wide variety of volume-based use cases that drive how ingestion with the OSDU(TM) data platform will occur. The use cases span from a single record to millions of records. There are also multiple sources of data in multiple formats. Additionally, the Storage Service `createOrUpdate` API endpoint is by default programmed to receive at most 500 records at a time. As such, any ingestion workflow must determine how many records it needs to save and if that number exceeds 500, it must batch writes accordingly.
However, the lowest-common-denominator is a record that will be stored in OSDU via the Storage API. Therefore, we have the ability to design and build a DAG operator that is capable of receiving a list of records that will then batch them according to the Storage Service's `createOrUpdate` configuration, perform the writes, capture the results, and make them available via logging. This approach will prevent other ingestion workflows from implementing custom batching, which reduces code duplication and enables a move toward standardization.
## Scope
- A single DAG Operator that has an expected set of inputs, outputs, and errors
- The DAG Operator will have the ability to receive a list of records, which it will batch and send to the Storage Service's `createOrUpdate` API endpoint
- The DAG Operator will write the records in the order provided by the list (starting with position 0 - assuming a zero-based list)
- The DAG Operator will log the ID of each record and its outcome (success, error) using the XCom logging style used by Manifest Ingestion
- The DAG Operator will not handle Surrogate Keys (or should it?)
## Decision
- Create a common DAG Operator that can batch and write records to the Storage Service's `createOrUpdate` API endpoint.
## Rationale
- This approach will standardize the writing step of ingestion, provide batching for the Storage Service's limit on `createOrUpdate`, and reduce code duplication by creating a reusable DAG Operator.
## Consequences
- No consequences as the DAG Operator is optional. This ADR does not suggest making the use of the generic batch operator a requirement for DAG implementations.
## When to revisit
- N/A
## Tradeoff Analysis - Input to decision
- No tradeoffs as leveraging the DAG Operator is optional. Other ingestion workflows may opt to exclude it from their DAG.
## Decision timeline
Decision ready to be made.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/56Audit trail shows generic service account and not the actual Data Loader's name2023-08-21T04:36:57ZKateryna Kurach (EPAM)Audit trail shows generic service account and not the actual Data Loader's nameOriginal issue identified by the Pre-shipping team:
https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/110
Please see a new record created by “Ingestion workflow” in AWS Pre-ship (R3M4) environment. ...Original issue identified by the Pre-shipping team:
https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues/110
Please see a new record created by “Ingestion workflow” in AWS Pre-ship (R3M4) environment. The issue is that the real username is not captured. Instead we see a service account name. **[serviceprincipal@testing.com](mailto:serviceprincipal@testing.com)**
GET {{osduonaws_base_url}}/api/storage/v2/records/osdu:master-data--Well:katetest2:mar24
`{"data":{"ResourceSecurityClassification":
"osdu:reference-data--ResourceSecurityClassification:Public:",
"Source":"NL_TNO",
"SpatialLocation":{"Wgs84Coordinates":{"type":"FeatureCollection",
"features":[{"type":"Feature","geometry":{"type":"Point","coordinates":[3.51906683,55.68101428]},
"properties":{}}]}},
"FacilityID":"10110909",
"FacilityTypeID":"osdu:reference-data--FacilityType:WELL:",
"FacilityOperator":[{"FacilityOperatorID":"410464","FacilityOperatorOrganisationID":"osdu:master-data--Organisation:HESS:"}],
"FacilityName":"DC-A05-01",
"FacilityNameAlias":[{"AliasName":"DC-A05-01","AliasNameTypeID":"osdu:reference-data--AliasNameType:WELL_NAME:"}],
"FacilityEvent":[
{"FacilityEventTypeID":"osdu:reference-data--FacilityEventType:SPUD_DATE:",
"EffectiveDateTime":"1999-06-03T00:00:00"}],
"VerticalMeasurements":[{"VerticalMeasurementID":"Kelly Bushing","VerticalMeasurement":36.6,
"VerticalMeasurementPathID":"osdu:reference-data--VerticalMeasurementPath:DEPTH_DATUM_ELEV:"}],
"NameAliases":[],"GeoContexts":[]},"meta":[],
"id":"osdu:master-data--Well:katetest2:mar24",
"version":1616620705209879,
"kind":"osdu:wks:master-data--Well:1.0.0",
"acl":{"viewers":["data.default.viewers@osdu.testing.com"],"owners":["data.default.owners@osdu.testing.com"]},
"legal":{"legaltags":["osdu-public-usa-dataset-1"],"otherRelevantDataCountries":["US"],"status":"compliant"},
"createUser":"serviceprincipal@testing.com",
"createTime":"2021-03-24T21:18:24.936Z",
"modifyUser":"serviceprincipal@testing.com",
"modifyTime":"2021-03-24T21:18:25.223Z"}
`