[ADR-0009 of wg-data-architecture] Universal_Data_Content_ARRAY_of_Values_API
Universal_Data_Content_ARRAY_of_Values_API
- The objective of this ADR is to propose to define a Common API to access all types of optimized storage of Data Content Array Of Values.
- The Information required to specify the behaviour of this API should be available from the catalog ( in a shared context)
- This API should be implemented to access the optimized Storage Content Array of Values on the supports provided by the diverse DDMSs (e.g: parquet file, oVDS file Collection, PostGreSQL blobs) or from the Catalog itself.
- The overall objective is to allow a Given DDMS-1 to link the DataValues provided previously by another DDMS-0 on its prefered "DDMS-0 native" support to a DDMS-1 Data Content schema Entity and allows to get the Values directly from the DDMS-0 support without copy it on a DDMS-1 support.
Status
-
Proposed -
Approved -
Implementing (incl. documenting) -
Testing -
Released
Context & Scope
Main objective : facilitate the delivery of optimized information gathered in the OSDU platform by Datasets and DDMSs
OSDU aims to be a cross-domain platform. Some core entities like Well and Wellbore are relevant to many domains, which may want to associate domain specific properties with the entities on different DDMSs.
today a solution is presented to associate specific DDMS information to OSDU core entities in https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/Guides/Chapters/93-OSDU-Schemas.md#appendix-d34-x-osdu-side-car-type-to-side-car-relationship
Goal 1: "Slim Entities" It is OSDU's goal to keep the shared context relevant to everybody (=domain independent) unambiguous.
Goal 2: "Agile Domains" RDDMS Domains must be empowered to promote change for the benefit of the domain without impacting all other domains. As a consequence schema must be split into,a shared context for interoperability, and a bounded, domain specific context for the domain.
Side-Car Pattern for Schemas The shared context is captured by the 'main', domain-independent entities as schema definition - in the analogy the 'motorbike'. Bounded context by a domain is added by a side-car schema. The side-car entity extension refers to the shared context by id. This is illustrated in the following diagram: see OSDU_ARCHITECTURE/side_car_capture.png
The center column shows the shared context. Generic discoverability is provided via platform services like Search and GIS. Domain specific extensions are defined by the domains independently. Such extensions can use domain driven language, which may be ambiguous outside the bounded domain context. Often domains create their own Domain Data Management Services (DDMS). Such services understand the composition of shared and bounded contexts and can shield applications and users from the complexity of the side-car record implementations. This means that the DDMSs can return the combination of the bounded and shared contexts on queries.
But it could be interesting also to ensure that DDMS 2 can access directly to Data Array Values attached to Core Entities defined in the shared context and generated by a DDMS 1. Because if it is not the case (as today) for an application we can have two situations : The Data array Values should be accessed by the API of the DDMS 1 without any relationship with the DDMS 2 bounded Context. The Application will have to rebuild relationships. The Data array Values should be firstly accessed by the API of the DDMS 1 in the bounded context of the DDMS 2, and this Data Array Values in another "shape" should be copied in the DDMS 2 and attached in its proper bounded context.
Difficulty one : These two use case are not satisfactory and are very common (Seismic DDMS <-> Reservoir DDMS, Well DDMS <->Reservoir DDMS, Seismic DDMS <-> WellDDMS, RAFSDDMS -> Reservoir DDMS)
A Difficulty two is the fact that today in the Core entities we do not have "mandatory" informations when we define the Data Array Values properties (types (boolean, integer, float, doubles, string), nb of columns, size of columns). In this case if an Application does not deliver this information in the Catalog it will be impossible to another to read and use it. The interoperability between Applications will be impossible because the DDMSs cannot deliver all content.
this ADR intend to address these two difficulties
Decision to be made
The share context could take care to define totally how all applications can access Data Arrays of Values on file Data Content and DDMSs Data Content. Data Array of Values are not difficult to describe and we could deliver a Data Array Values API abstraction level which could be used for all file Data Content and all DDMS Data content. Using this abstract API level all Data Array Values of File Data and DDMS Data Content could be accessible by a DDMS which was not at the origin of these Data Array Values.
Note : The information to authorize access to all Data Array values should be mandatory (see description of the method item 3/).
Description of the proposed method to apply if the Data Array values are embedded into an external content (and if we add on a WPC (like welllog) an abstract Colummn Based Table) :
1/ The WPC designed to deliver the data content should have a link to a persistent support (e.g: mentioning a Datasetfile (could be a parquet file), DatasetfileCollection (could be an oVDS collection), uri of a DatasetETPdataspace, urn: etc..)
2/ Inside this dataset persistent support the Data Array of values concerning this WPC will be associated to the id of the WPC : (e.g: "id": "namespace:work-product-component--WellLog:c2c79f1c-90ca-5c92-b8df-04dbe438f414")
3/ And just after inside the WPC the different information attached to the Data Content could be also accessible : "ColumnName", "ValueType" (double, number, string, boolean), "ValueCount" (nb of columns or dimensions), "Column size" (nb of values in the column), it could be more detailed in ValueType (see after : energistics Data type ETP V1.2 documentation) for each "Column name" we should have also: "UnitofMeasureID" + "UnitQuantityId + PropertyTypeID
IMPORTANT WARNING : we should find a way to impose that this information MUST be present in the WPC (e.g: by enhancement of the Validation step during ingestion)
By default no more information is given but this looks enough to proceed.
ex: for WellLog WPC here are the Information to deliver into the Catalog : "id": "namespace:work-product-component--WellLog:c2c79f1c-90ca-5c92-b8df-04dbe438f414" "DDMSDatasets": [ "urn://wddms-3/uuid:20840361-adc0-4842-999b-5639bd07bb38" { "ColumName": "CO2-SAT-Fraction-VP", = "Array meta data in Energistics ETP V1.2" "ValueType": "double", = "DataArrayType in Energistisc ETP V1.2" "ValueCount": 1, "ColumnSize": 7, = "dimension in Energistics ETP V1.2" "UnitQuantityID": "namespace:reference-data--UnitQuantity:unitless:", "PropertyType": { "PropertyTypeID": "namespace:reference-data--PropertyType:8a9930de-6d50-4165-8bcd-8ddf2e6aa7fa:", "Name": "Co2 Volume Fraction" }, },
"Value Type" reference in Energistics ETP V1.2 documentation.
"Energistics.Etp.v12.Datatypes.ArrayOfBoolean", "Energistics.Etp.v12.Datatypes.ArrayOfNullableBoolean", "Energistics.Etp.v12.Datatypes.ArrayOfInt", "Energistics.Etp.v12.Datatypes.ArrayOfNullableInt", "Energistics.Etp.v12.Datatypes.ArrayOfLong", "Energistics.Etp.v12.Datatypes.ArrayOfNullableLong", "Energistics.Etp.v12.Datatypes.ArrayOfFloat", "Energistics.Etp.v12.Datatypes.ArrayOfDouble", "Energistics.Etp.v12.Datatypes.ArrayOfString", "Energistics.Etp.v12.Datatypes.ArrayOfBytes",
We can note that we have today an " existing" method if the Data Array values are not important in size and should preferently be embedded in the catalog. this one is restricted on "value type" to the original list. In this case we should a tag "ColumnValues" with "number" or "double" or "string" or "boolean".
** Now If this information is embedded into the shared context we will be able to access to data Content ARRAY of Values.
On the base of these information we could provide the specification of an API to refers, write and read this Data Content values arrays.
This API could then be used "internally" by all DDMSs to associates these Values to their own Data Content schema ( bounded context). Depending on the context on which the Content Data Array of Values were stored, All DDMS could be able manage these data. On a firts step , each "ingestion DAG" or "DDMSs" could use this information to associate Data Content and shared context.
They could use DataArray specific services to transfer large, binary arrays of homogeneous data values. For example, with Energistics domain standards (see ETP V1.2 protocol 9 page 287 : https://docs.energistics.org/EO_Resources/ETP_Specification_v1.2_Doc_v1.1.pdf), this data is often stored as an HDF5 file.
This API could provide a DataArray transfer which :
- Supports any array of values of different types (boolean, integer, float, doubles, string). This array data is typically associated with a data object (that is, it is the binary array data for the data object).
- Imposes no limits on the dimensions of the array. Multi-dimensional arrays have no limits to the number of dimensions.
- Was originally designed in Energistics standard to support transfer of the data typically stored in HDF5 files but also can be used to transfer this type of data when HDF5 files are not required or used(e.g: parquet files, oVds File Collection, PostGreSQL bulk data, Time series DB)
Rationale
This proposal is based on experiences gathered by the Energistics standards teams : effective separation between meta data on Array of Values (written in XML or JSON files) and Array of Values (written in binary compressed format).
All DDMS could discuss with the Catalog at the meta data level on the Content Data Array Values A DDMS can refer a Content Data Array of Values without copying the Data Array of Values and can beneficiate of an optimized access on Array of Values developed by another DDMS.
Consequences
From a first query on the Catalog, all Data Content Array of Values will be accessible directly of through a more sophisticated DDMS query.
This will not imply a lot of change in the shared context Data Definition side : e.g: update of the abstract Column based Table and add it to all WPC which must handle Data Content Array Values. It is possible that some more data definition effort should be necessary to cover all data content Array of values handled by the diverse DDMS. The API of the DDMSs themselve will not change but the link between the shared Context and each bounded context should be updated. all DDMS should deliver a way to reference, write and read their specific Data Content Array of Values from information contained in the Catalog (shared context)