# OSDU Data Platform - Data Loading Quick Start Guide
## Contents
[[_TOC_]]
## Objective
This quickstart guide is designed to be simplistic in nature with the objective of getting you up to speed with the basics of data loading and data ingestion. By the end of the guide, you should be able to grasp the key concepts and methods of data loading and data ingestion.
## Important Documentation :exclamation:
These are the important links/documents that, you should first try to read and be familiar with:
1.[OSDU Schema Usage Guide](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/README.md) - This is the latest schema usage guide that you should be familiar with to better understand the data schema structure.
2.[OSDU Data Definitions](https://community.opengroup.org/osdu/data/data-definitions) - This page contains the latest data definitions schema and reference values.
3.[OSDU Core Services](https://community.opengroup.org/osdu/documentation/-/wikis/Core-Services-Overview) - This contains the latest core services API and documentation that the OSDU platform supports.
## Overview
The OSDU Data Platform is versatile and designed to support multiple data loading use cases. The approaches recommended in this document are meant to offer a perspective for data ingestion. This document does not intend to prescribe the only path to data ingestion, and the approach provided is illustrative of some of the platform capabilities. We encourage you to engage with the OSDU member community with questions and feedback.
This data loading guide attempts to describe the latest practices for ingesting data into the OSDU Data Platform. The contents are intended to be fast-changing and evolving as the data loading capabilities of the platform are always updating. Once the workflows are matured, it will then be updated and reflected on the official documentation.
This document addresses end-to-end data loading from the perspective of the end-user, which in most cases is a member of the information management or data platform capabilities team. Hence, this guide assumes that this end-user has some basic technical knowledge regarding HTTP Web API, JSON data structure, and Python.
## Introduction
The Data Flow workstream covers the full end-to-end process of facilitating data into the OSDU platform. The Data Ingestion workstream is part of the Data Loading workstream.
* Loading – this workstream captures all the work necessary to ready the data for ingestion. Activities might include:
* Data fetching and organization
* Data massaging and formatting
* Manifest/Metadata creation
* Loading source data to a landing zone or staging area
* Ingestion – this workstream facilitates data into the OSDU Data Platform. There are several data ingestion services available.
### Data Platform Overview (Simplified)
Below is a simplified overview of the data platform. There are more interactions and layers between services that are not fully illustrated below.
1.[Reference Data](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data) - These are the standard naming for the data values. For example, the reference value for measured depth is MD and for elevation is ELEV. Whenever these values are being used, the reference data must be first loaded in the OSDU platform. There are 3 governance levels for the reference data:
*[Fixed](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data/FIXED) - Pre-determined by agreement in OSDU forum and shall not be changed. This allows interoperability between companies.
*[Open](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data/OPEN) - Agreed by OSDU forum but companies may extend with custom values. Custom values shall not conflict with Forum values. This allows some level of interoperability between companies.
*[Local](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data/LOCAL) - OSDU forum makes no declaration about the values and companies need to create their own list. This list does not benefit much from interoperability and agreed-upon values are hard to come by.
2.[Master Data](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/Generated/master-data) - A record of the information about business objects that we manage in the OSDU record catalog. For example, a list of field names with well names and their associated wellbore names.
3.[Work Product](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/Generated/work-product) - A record that ties together a set of work product components such as a group of well logs inside a wellbore.
4.[Work Product Components](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/Generated/work-product-component) - A record that describes the business content of a single well log, such as the log data information, top, bottom depth of the well log.
* Here is the [list](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/E-R/work-product-component#supported-bulk-standards) of the supported bulk standards in OSDU.
5.[File](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/Generated/dataset) - A record that describes the metadata about the digital files, but does not describe the business content of the file, such as the file size, checksum of a well log.
### Getting the data into the OSDU Data Platform
Once you are familiar with the OSDU data types, you must understand that there are several [data flow services](https://community.opengroup.org/osdu/documentation/-/wikis/Core-Services-Overview#data-flow-services-and-apis) to bring the data into the OSDU data platform. There are pros & cons in each approach as detailed in each section link below.
*[CSV Parser Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser) – Schlumberger has developed an ingestion workflow capable of parsing a CSV file into a schema and loading each entry as a record into OSDU. There is future work to enrich the flattened schema structure created by the CSV parser into an R3-style schema.
*[Manifest-based Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags) – The manifest ingestion workflow leverages a manifest schema definition defined by the Data Definitions team to facilitate data into the OSDU Data Platform.
*[WITSML Parser Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics-osdu-integration) – Energistics have created an ingestion workflow capable of parsing WITSML into R3 schema formats.
The fundamental idea in each of these ingestion methods is to trigger [storage service API](https://community.opengroup.org/osdu/platform/system/storage) to create the records. Alternatively, one can also run the storage service API directly to create the records but note that this approach is very forgiving and could lead to unexpected behavior.
Below is a high level overview of the ingestion services available:
This guide assumes you have access to a working OSDU environment, please contact your cloud service provider for access.
### Steps
In this quickstart guide, we will use the [open-test-data](https://community.opengroup.org/osdu/platform/open-test-data) to demonstrate the steps above. In this example, we describe one of the three methods described above - Manifest-based Ingestion.
***Manifest-based Ingestion**
1. Load reference data in the OSDU data platform
* For TNO example, the reference data [manifests](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/4-instances/TNO/reference-data) should first be loaded into the OSDU platform.
* Any other missing reference data can be found in the OSDU community [reference data](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data). This repository is maintained by the Data Definitions team.
2. Prepare the master/WPC data manifests JSON
* Here is a set of [Python data preparation scripts](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/2-scripts) to help with the manifest generation.
* You can either [learn](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-generate-manifests-using-scripts) to generate them from scratch with the [scripts](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/2-scripts) or use to ones that have been [generated](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/4-instances) from the scripts.
3. Load master/WP/WPC data in the OSDU data platform
* Send a POST request to `{OSDU_BASE_URL}/api/workflow/v1/workflow/Osdu_ingest/workflowRun` with the manifest JSON in the request body to trigger the workflow ingestion service as shown in an example below:
This section runs through the common tasks in data loading and ingestions. Refer to the links in each section to dive deeper.
1.[How to generate manifests using scripts](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-generate-manifests-using-scripts) - by Yanbin Zhang [Chevron]
2.[How to load a LAS data (Manifest-based Ingestion, Metadata only, without Wellbore DDMS](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-load-a-LAS-data-(Manifest-based-Ingestion)) - by Ivar Sørheim [Equinor]
3.[How to perform a CSV Ingestion with Dataset service](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/CSV-Ingestion) - by Chad Leong [SLB]
3.[How to perform a CSV Ingestion with File service](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/CSV-Ingestion-File-Service) - by Samiullah Ghousudeen [BP]
4.[How to set up sdutil for uploading seismic data in Windows](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/Step-by-Step-of-Setting-up-Python-Environment-for-sdutil-in-Windows) - by Chad Leong [SLB]
5.[How to load a seismic data via Seismic DDMS - SEGY to oZGY (Storage service API)](https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-zgy-conversion/-/blob/master/doc/testing.md) - by Andras Szalai [EPAM]
6.[How to load a seismic data via Seismic DDMS - SEGY to oZGY](https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-zgy-conversion/-/blob/master/doc/gcp/QUICKSTART.md) - by Yan Sushchynski [EPAM]
7.[How to load a seismic data via Seismic DDMS - SEGY to oVDS (Storage service API)](https://community.opengroup.org/osdu/platform/data-flow/ingestion/segy-to-vds-conversion/-/blob/master/docs/gcp/QUICKSTART.md) - by Yan Sushchynski [EPAM]
8.[How to load a WITSML data (WITSML Parser)](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-load-a-WITSML-data-(WITSML-Parser)) - by Kateryna Kurach [EPAM]
9.[How to load a generic file in OSDU using File/Dataset service](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-load-a-generic-file-in-OSDU-using-File-or-Dataset-Service) - by Kateryna Kurach [EPAM]
9.[How to check for error in Airflow Dag](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-check-for-error-in-Airflow-DAG) - by Chad Leong [SLB]
10.[How to search for ingested record](https://community.opengroup.org/osdu/platform/system/search-service/-/blob/master/docs/tutorial/SearchService.md)
11.[Troubleshooting Index Status of Data Ingested](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/Troubleshooting-Index-Status-of-Data-Ingested) - by Samiullah Ghousudeen [BP]
12.[Wellbore DDMS Data Loader Utility Quickstart guide](https://community.opengroup.org/osdu/platform/data-flow/data-loading/wellbore-ddms-data-loader/-/wikis/Wellbore-DDMS-Data-Loader-Utility-Quickstart-Guide) - by Samiullah Ghosudeen [BP]
## Bulk loading
Once the basic data loading concept is understood, the next step is to bulk load the data available in your system. Refer to the link below for loading bulk data.
OSDU CLI Data Loader (https://community.opengroup.org/osdu/platform/data-flow/data-loading/osdu-cli)
## Worked examples
Here are some [worked examples](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/Examples/WorkedExamples) prepared by the data definition team on different entities:
[Seismic DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic) – This effort is a part of the Seismic DMS efforts. This workflow runs within the Workflow Service, and as such is related to the overall data ingestion efforts. Here is the end-to-end workflow:
[Wellbore DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/wellbore) – Wellbore Domain Data Management Services including type-safe entity access and optimized accessors for bulk data such as logs, trajectories, checkshots.
[Well Delivery DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/well-delivery) - Well Delivery and Well Construction related domain data management services
[Reservoir DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/reservoir) - Reservoir related domain data management services, to support static modeling data types as covered by RESQML, incl. seismic interpretations, structural models, 2D and 3D property grids.
## Terms and Acronyms
| Term | Description |
|------|-------------|
| Airflow | Airflow is the designated workflow engine for OSDU. Airflow is used to schedule and orchestrate the different workflows in OSDU for data flow. Best practices can be found [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/wikis/Ingestion-DAG-Best-Practices). |
| Manifest | A manifest is a container specifically designed for facilitating metadata into the OSDU platform. As of this writing, a Manifest has structures to support holding metadata records of the following types: Reference Data Master Data Work Product Work Product Component Datasets |
| Source Data | Source Data might be an Excel file, LAS/DLIS files, Seismic data, text files, data streams, databases, etc. One of the goals of OSDU is to support storing source data in its original format to preserve lineage. Metadata is created to allow this source data to remain in its original format yet remain searchable and discoverable within the platform. |
| DDMS | Domain Data Management Services. This provides a single consistent set of APIs and methods to access the data objects regardless of the domain workflow. Here's a [list of the DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services) currently being developed. |
## FAQs
### Referential Integrity Check - Failed Example
Note that referential integrity is not Reference entity but anything that is "referenced", such as Master data - Well referring to another Master record - Organisation.
In the **provide_manifest_integrity_task** operator, you will see a warning log below:
```plaintext
{validate_referential_integrity.py:209} WARNING - Resource with kind osdu:wks:master-data--Well:1.0.0 and id: 'osdu:master-data--Well:1000' was rejected.
To address this, you need to first ingest the missing reference data before the others.
### Referential Integrity Check - Success Example
In the **process_manifest_task** operator, you will see the success log below:
```
[2021-07-05 22:52:20,294] {process_manifest_r3.py:165} INFO - Sending records to Storage service
[2021-07-05 22:52:22,087] {process_manifest_r3.py:173} INFO - Records 'osdu:reference-data--GeoPoliticalEntityType:Parish' were saved using Storage service.
[2021-07-05 22:52:22,087] {process_manifest_r3.py:165} INFO - Sending records to Storage service
[2021-07-05 22:52:22,790] {process_manifest_r3.py:173} INFO - Records 'osdu:reference-data--GeoPoliticalEntityType:Community' were saved using Storage service.
[2021-07-05 22:52:22,791] {process_manifest_r3.py:165} INFO - Sending records to Storage service
[2021-07-05 22:52:23,048] {process_manifest_r3.py:173} INFO - Records 'osdu:reference-data--GeoPoliticalEntityType:Prefecture' were saved using Storage service.
[2021-07-05 22:52:23,048] {process_manifest_r3.py:165} INFO - Sending records to Storage service
[2021-07-05 22:52:23,319] {process_manifest_r3.py:173} INFO - Records 'osdu:reference-data--GeoPoliticalEntityType:Principality' were saved using Storage service.
[2021-07-05 22:52:23,320] {process_manifest_r3.py:165} INFO - Sending records to Storage service
[2021-07-05 22:52:23,487] {process_manifest_r3.py:173} INFO - Records 'osdu:reference-data--GeoPoliticalEntityType:Province' were saved using Storage service.
[2021-07-05 22:52:23,487] {process_manifest_r3.py:165} INFO - Sending records to Storage service
[2021-07-05 22:52:23,682] {process_manifest_r3.py:173} INFO - Records
[2022-03-14, 09:05:14 UTC] {process_manifest_r3.py:132} ERROR - Response status: 400. Response content: {"code":400,"reason":"Invalid ACL","message":"Acl not match with tenant or domain"}.
[2022-03-14, 09:05:14 UTC] {authorization.py:137} ERROR - {"code":400,"reason":"Invalid ACL","message":"Acl not match with tenant or domain"}
| skipped_ids | [{'id': 'osdu:master-data--Organisation:SLB', 'kind': 'osdu:wks:master-data--Organisation:1.0.0', 'reason': '400 Client Error: Bad Request for url: http://os-storage.osdu-services:8080/api/storage/v2/records'}] |
## Contributing
We welcome all kinds of contributions, including ideas, workflow requests, and documentation. The preferred way of submitting a contribution is to either make an issue on GitLab or submit a merge request.
Contents of the guide are contributed by different forum members.