File DMS

Status

Context & Scope

We have a few things that are specific to managing data that is stored in a File:

We need to be able to get information required for a cloud SDK to be able to upload a File onto a particular cloud providers object storage. As it is today, the getFileUploadLocation endpoint generates a signedURL for a Folder, which creates security issues for some cloud providers. Rather, the FileUploadLocation should be an object with the data required to work with that particular cloud providers object SDK.
We need to be able to register a Dataset for the File, once it is uploaded using the cloud provider's SDK. This endpoint is a convenience method on top of the Datset Registry service, where this endpoint uses standard File Dataset semantics.
We need to be able to provide access to a File stored on a cloud provider's object storage via the cloud provider's SDK. For example, OpenVDS will need certain information to be able to access a SegY File stored on AWS s3, so the OpenVDS software can then create the additional files it creates as part of its data enrichment process. Thus, based on a Dataset Registry for a SegY File, OpenVDS would need to get Delivery Instructions that can be used along with the cloud provider's SDK to access the File on the cloud provider's object storage.

All of these are core concepts for a DMS (Data Management System); we need to be able to read and write data to a particular data storage mechanism - all through the same interface. The File DMS brings together everything a client application needs to be able to interface with a File stored on a cloud provider's object storage, and where access would be managed via the cloud provider's SDK.

The File DMS was first proposed by Schlumberger. Unfortunately, that proposal did not resolve the issue of a FileUploadLocation being a signedURL instead of an object. In addition, the concept of a Dataset Registry had not been introduced. Now that the Dataset Registry has been introduced, it makes sense to bring all of these concepts together.

The File DMS is being proposed as a Core Service, because this is data workflow agnostic. The Data Flow project should be focusing on how to build data workflows with the framework provided by the Core Services.

Key Use Cases

CSV

Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
Store CSV file using the cloud provider SDK for their object storage
Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
“Ingest” the CSV data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the CSV file as an input parameter), that will execute the “CSV ingestion DAG”; that DAG is not there yet, but, with a little help from SLB, I think we can get it there

WITSML

Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
Store WITSML file using the cloud provider SDK for their object storage
Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
“Ingest” the WITSML data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the WITSML file as an input parameter), that will execute the “WITSML ingestion DAG”; that DAG is not there yet, but, with a little help from Energistics, I think we can get it there

SegY

Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
Store SegY file using the cloud provider SDK for their object storage
Register the Dataset (for the SegY File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
Store Manifest file (describing the SegY file) using the cloud provider SDK for their object storage
Register the Dataset (for the Manifest File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
“Ingest” the SegY data using the Data Workflow service (code available in GitLab), that will execute the “SegY Manifest ingestion DAG”; we have this DAG already based on the EPAM data loader script

Decision

Introduce the File DMS as a new Core Service
Deprecate the current File Service
Remove the File capabilities from the generic Delivery Service

Rationale

A number of issues have been raised about the File Service and the use of FileID throughout the Core Services; this addresses all of those issues with minimal impact on the overall system.

Consequences

Not accepting this will have cloud providers with their own implementations of the File Service; no standard is possible as defined.

Tradeoff Analysis - Input to decision

As we move more into R3 and various DMS and DDMS components, this problem is going to be harder to resolve. We need to act now in order to avoid significant rework in the future.

Decision criteria and tradeoffs

The decision is simple, act to resolve the issue identified or do nothing.

Decision timeline

Code is already in open source.

Edited Nov 09, 2020 by Joe

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message