Update links that were moved - OSDU Data Platform Data Loading Quick Start Guide authored by Débora Barretto's avatar Débora Barretto
...@@ -14,7 +14,7 @@ These are the important links/documents that, you should first try to read and b ...@@ -14,7 +14,7 @@ These are the important links/documents that, you should first try to read and b
1. [OSDU Schema Usage Guide](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/README.md) - This is the latest schema usage guide that you should be familiar with to better understand the data schema structure. 1. [OSDU Schema Usage Guide](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/README.md) - This is the latest schema usage guide that you should be familiar with to better understand the data schema structure.
2. [OSDU Data Definitions](https://community.opengroup.org/osdu/data/data-definitions) - This page contains the latest data definitions schema and reference values. 2. [OSDU Data Definitions](https://community.opengroup.org/osdu/data/data-definitions) - This page contains the latest data definitions schema and reference values.
3. [OSDU Core Services](https://community.opengroup.org/osdu/documentation/-/wikis/Core-Services-Overview) - This contains the latest core services API and documentation that the OSDU platform supports. 3. [OSDU Core Services](https://community.opengroup.org/groups/osdu/platform/-/wikis/Core-Services-Overview) - This contains the latest core services API and documentation that the OSDU platform supports.
## Overview ## Overview
...@@ -24,6 +24,14 @@ This data loading guide attempts to describe the latest practices for ingesting ...@@ -24,6 +24,14 @@ This data loading guide attempts to describe the latest practices for ingesting
This document addresses end-to-end data loading from the perspective of the end-user, which in most cases is a member of the information management or data platform capabilities team. Hence, this guide assumes that this end-user has some basic technical knowledge regarding HTTP Web API, JSON data structure, and Python. This document addresses end-to-end data loading from the perspective of the end-user, which in most cases is a member of the information management or data platform capabilities team. Hence, this guide assumes that this end-user has some basic technical knowledge regarding HTTP Web API, JSON data structure, and Python.
## Terms and Acronyms
| Term | Description |
|------|-------------|
| Airflow | Airflow is the designated workflow engine for OSDU. Airflow is used to schedule and orchestrate the different workflows in OSDU for data flow. Best practices can be found [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/wikis/Ingestion-DAG-Best-Practices). |
| Manifest | A manifest is a container specifically designed for facilitating metadata into the OSDU platform. As of this writing, a Manifest has structures to support holding metadata records of the following types: Reference Data Master Data Work Product Work Product Component Datasets |
| Source Data | Source Data might be an Excel file, LAS/DLIS files, Seismic data, text files, data streams, databases, etc. One of the goals of OSDU is to support storing source data in its original format to preserve lineage. Metadata is created to allow this source data to remain in its original format yet remain searchable and discoverable within the platform. |
| DDMS | Domain Data Management Services. This provides a single consistent set of APIs and methods to access the data objects regardless of the domain workflow. Here's a [list of the DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services) currently being developed. |
## Introduction ## Introduction
The Data Flow workstream covers the full end-to-end process of facilitating data into the OSDU platform. The Data Ingestion workstream is part of the Data Loading workstream. The Data Flow workstream covers the full end-to-end process of facilitating data into the OSDU platform. The Data Ingestion workstream is part of the Data Loading workstream.
...@@ -59,11 +67,11 @@ These are the main OSDU data types: ...@@ -59,11 +67,11 @@ These are the main OSDU data types:
### Getting the data into the OSDU Data Platform ### Getting the data into the OSDU Data Platform
Once you are familiar with the OSDU data types, you must understand that there are several [data flow services](https://community.opengroup.org/osdu/documentation/-/wikis/Core-Services-Overview#data-flow-services-and-apis) to bring the data into the OSDU data platform. There are pros & cons in each approach as detailed in each section link below. Once you are familiar with the OSDU data types, you must understand that there are several [data flow services](https://community.opengroup.org/groups/osdu/platform/-/wikis/Core-Services-Overview#data-flow-services-and-dags) to bring the data into the OSDU data platform. There are pros & cons in each approach as detailed in each section link below.
* [CSV Parser Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser) – Schlumberger has developed an ingestion workflow capable of parsing a CSV file into a schema and loading each entry as a record into OSDU. There is future work to enrich the flattened schema structure created by the CSV parser into an R3-style schema. * [CSV Parser Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser) – Schlumberger has developed an ingestion workflow capable of parsing a CSV file into a schema and loading each entry as a record into OSDU. There is future work to enrich the flattened schema structure created by the CSV parser into an R3-style schema.
* [Manifest-based Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags) – The manifest ingestion workflow leverages a manifest schema definition defined by the Data Definitions team to facilitate data into the OSDU Data Platform. * [Manifest-based Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags) – The manifest ingestion workflow leverages a manifest schema definition defined by the Data Definitions team to facilitate data into the OSDU Data Platform.
* [WITSML Parser Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics-osdu-integration) – Energistics have created an ingestion workflow capable of parsing WITSML into R3 schema formats. * [WITSML Parser Ingestion](https://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser) – Energistics have created an ingestion workflow capable of parsing WITSML into R3 schema formats.
The fundamental idea in each of these ingestion methods is to trigger [storage service API](https://community.opengroup.org/osdu/platform/system/storage) to create the records. Alternatively, one can also run the storage service API directly to create the records but note that this approach is very forgiving and could lead to unexpected behavior. The fundamental idea in each of these ingestion methods is to trigger [storage service API](https://community.opengroup.org/osdu/platform/system/storage) to create the records. Alternatively, one can also run the storage service API directly to create the records but note that this approach is very forgiving and could lead to unexpected behavior.
...@@ -78,16 +86,16 @@ This guide assumes you have access to a working OSDU environment, please contact ...@@ -78,16 +86,16 @@ This guide assumes you have access to a working OSDU environment, please contact
### Steps ### Steps
In this quickstart guide, we will use the [open-test-data](https://community.opengroup.org/osdu/platform/open-test-data) to demonstrate the steps above. In this example, we describe one of the three methods described above - Manifest-based Ingestion. In this quickstart guide, we will use the [open-test-data](https://community.opengroup.org/osdu/data/open-test-data) to demonstrate the steps above. In this example, we describe one of the three methods described above - Manifest-based Ingestion.
* **Manifest-based Ingestion** * **Manifest-based Ingestion**
1. Load reference data in the OSDU data platform 1. Load reference data in the OSDU data platform
* For TNO example, the reference data [manifests](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/4-instances/TNO/reference-data) should first be loaded into the OSDU platform. * For TNO example, the reference data [manifests](https://community.opengroup.org/osdu/data/open-test-data/-/tree/master/rc--3.0.0/4-instances/TNO/reference-data) should first be loaded into the OSDU platform.
* Any other missing reference data can be found in the OSDU community [reference data](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data). This repository is maintained by the Data Definitions team. * Any other missing reference data can be found in the OSDU community [reference data](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/ReferenceValues/Manifests/reference-data). This repository is maintained by the Data Definitions team.
2. Prepare the master/WPC data manifests JSON 2. Prepare the master/WPC data manifests JSON
* Here is a set of [Python data preparation scripts](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/2-scripts) to help with the manifest generation. * Here is a set of [Python data preparation scripts](https://community.opengroup.org/osdu/data/open-test-data/-/tree/master/rc--3.0.0/2-scripts) to help with the manifest generation.
* You can either [learn](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-generate-manifests-using-scripts) to generate them from scratch with the [scripts](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/2-scripts) or use to ones that have been [generated](https://community.opengroup.org/osdu/platform/open-test-data/-/tree/master/rc--3.0.0/4-instances) from the scripts. * You can either [learn](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-generate-manifests-using-scripts) to generate them from scratch with the [scripts](https://community.opengroup.org/osdu/data/open-test-data/-/tree/master/rc--3.0.0/2-scripts) or use to ones that have been [generated](https://community.opengroup.org/osdu/data/open-test-data/-/tree/master/rc--3.0.0/4-instances) from the scripts.
3. Load master/WP/WPC data in the OSDU data platform 3. Load master/WP/WPC data in the OSDU data platform
* Send a POST request to `{OSDU_BASE_URL}/api/workflow/v1/workflow/Osdu_ingest/workflowRun` with the manifest JSON in the request body to trigger the workflow ingestion service as shown in an example below: * Send a POST request to `{OSDU_BASE_URL}/api/workflow/v1/workflow/Osdu_ingest/workflowRun` with the manifest JSON in the request body to trigger the workflow ingestion service as shown in an example below:
...@@ -225,13 +233,13 @@ This section runs through the common tasks in data loading and ingestions. Refer ...@@ -225,13 +233,13 @@ This section runs through the common tasks in data loading and ingestions. Refer
9. [How to check for error in Airflow Dag](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-check-for-error-in-Airflow-DAG) - by Chad Leong [SLB] 9. [How to check for error in Airflow Dag](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/How-to-check-for-error-in-Airflow-DAG) - by Chad Leong [SLB]
10. [How to search for ingested record](https://community.opengroup.org/osdu/platform/system/search-service/-/blob/master/docs/tutorial/SearchService.md) 10. [How to search for ingested record](https://community.opengroup.org/osdu/platform/system/search-service/-/blob/master/docs/tutorial/SearchService.md)
11. [Troubleshooting Index Status of Data Ingested](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/Troubleshooting-Index-Status-of-Data-Ingested) - by Samiullah Ghousudeen [BP] 11. [Troubleshooting Index Status of Data Ingested](https://community.opengroup.org/groups/osdu/platform/data-flow/data-loading/-/wikis/Troubleshooting-Index-Status-of-Data-Ingested) - by Samiullah Ghousudeen [BP]
12. [Wellbore DDMS Data Loader Utility Quickstart guide](https://community.opengroup.org/osdu/platform/data-flow/data-loading/wellbore-ddms-data-loader/-/wikis/Wellbore-DDMS-Data-Loader-Utility-Quickstart-Guide) - by Samiullah Ghosudeen [BP] 12. [Wellbore DDMS Data Loader Utility Quickstart guide](https://community.opengroup.org/osdu/ui/data-loading/wellbore-ddms-data-loader/-/wikis/Wellbore-DDMS-Data-Loader-Utility-Quickstart-Guide) - by Samiullah Ghosudeen [BP]
## Bulk loading ## Bulk loading
Once the basic data loading concept is understood, the next step is to bulk load the data available in your system. Refer to the link below for loading bulk data. Once the basic data loading concept is understood, the next step is to bulk load the data available in your system. Refer to the link below for loading bulk data.
OSDU CLI Data Loader (https://community.opengroup.org/osdu/platform/data-flow/data-loading/osdu-cli) OSDU CLI Data Loader (https://community.opengroup.org/osdu/ui/data-loading/osdu-cli)
## Worked examples ## Worked examples
...@@ -276,6 +284,8 @@ Here are some [worked examples](https://community.opengroup.org/osdu/data/data-d ...@@ -276,6 +284,8 @@ Here are some [worked examples](https://community.opengroup.org/osdu/data/data-d
## Domain Data Management Services ## Domain Data Management Services
[Domain Data Management Services- Homepage](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/home)
[Seismic DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic) – This effort is a part of the Seismic DMS efforts. This workflow runs within the Workflow Service, and as such is related to the overall data ingestion efforts. Here is the end-to-end workflow: [Seismic DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic) – This effort is a part of the Seismic DMS efforts. This workflow runs within the Workflow Service, and as such is related to the overall data ingestion efforts. Here is the end-to-end workflow:
![image__2_](uploads/ddfb5fc89ffb26723f446d1b7a34e792/image__2_.png) ![image__2_](uploads/ddfb5fc89ffb26723f446d1b7a34e792/image__2_.png)
...@@ -286,15 +296,6 @@ Here are some [worked examples](https://community.opengroup.org/osdu/data/data-d ...@@ -286,15 +296,6 @@ Here are some [worked examples](https://community.opengroup.org/osdu/data/data-d
[Reservoir DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/reservoir) - Reservoir related domain data management services, to support static modeling data types as covered by RESQML, incl. seismic interpretations, structural models, 2D and 3D property grids. [Reservoir DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/reservoir) - Reservoir related domain data management services, to support static modeling data types as covered by RESQML, incl. seismic interpretations, structural models, 2D and 3D property grids.
## Terms and Acronyms
| Term | Description |
|------|-------------|
| Airflow | Airflow is the designated workflow engine for OSDU. Airflow is used to schedule and orchestrate the different workflows in OSDU for data flow. Best practices can be found [here](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/wikis/Ingestion-DAG-Best-Practices). |
| Manifest | A manifest is a container specifically designed for facilitating metadata into the OSDU platform. As of this writing, a Manifest has structures to support holding metadata records of the following types: Reference Data Master Data Work Product Work Product Component Datasets |
| Source Data | Source Data might be an Excel file, LAS/DLIS files, Seismic data, text files, data streams, databases, etc. One of the goals of OSDU is to support storing source data in its original format to preserve lineage. Metadata is created to allow this source data to remain in its original format yet remain searchable and discoverable within the platform. |
| DDMS | Domain Data Management Services. This provides a single consistent set of APIs and methods to access the data objects regardless of the domain workflow. Here's a [list of the DDMS](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services) currently being developed. |
## FAQs ## FAQs
### Referential Integrity Check - Failed Example ### Referential Integrity Check - Failed Example
... ...
......