In its simplest depiction, a data environment can be described as three zones.
- The first is the loading (and ingestion and indexing) zone which preserves data and format from the source system.
- The second is the data processing zone which focuses on supporting intelligent discovery and enrichment through classification, [COMMENT: Define classification], assessment [COMMENT: Define assessment. OSDU uses this term for a specific functional step of ingestion.] and aggregation [COMMENT: Define aggregation. There are multiple possible meanings.
- The third is the consumption (or delivery) zone which optimizes data content schema, format and entitlements for a particular consumption workflow step.
NOTE: This addresses transactional data at the OSDU work-product-component and file (OpenDES Entity Kind) level. Master-data and reference-data follow a similar, but simpler data flow.
Data Flow for data content
- All source data originating from external systems,
- It is secured/entitled to achieve isolation
- It is tagged and indexed for discovery with general and type-appropriate metadata property values
- It can be refined and enriched producing new (derivatives) of the data
- It is exported with the quality, security and content data schema required by the consumer, often through the manifestation of a fit-for-purpose pond [COMMENT: Define pond.]
Further, the cycle can be repeated with the output of the data processing pipeline [COMMENT: Improve the terminology used here. data processing pipeline is not clear.] loaded, ingested, and indexed to produce derivative data that can be discovered and consumed by others in its own right.
LOADING, INGESTION, and INDEXING ZONE
Loading, ingestion, and indexing is the act of absorbing information from outside of OSDU. It can be implemented as either exposing data (creating a reference to externally hosted data that will have OSDU-compatible behavior) or adding data (creating materialized data content) to OSDU. The process focuses on minimizing friction and maximizing the amount of information that can be captured. The act of loading, ingestion, and indexing and the logical layer representing transfer to OSDU should not be mistaken for a singular implementation (a zone or a framework). [COMMENT: What is meant here by 'a zone or a framework'?] Loading, ingestion, and indexing actions represent a contract with well-defined rules. Any entry of the data to OSDU should follow this contract. If not, concerns such as compliance and lineage cannot be ensured. [COMMENT: More than these concerns. Strengthen this statement.]
There are two key concerns in the Ingestion Zone:
Governance is focused on understanding the policies around Right of Use and ensuring that these policies are honored. [COMMENT: Reconsider the term 'Governance' with a more precise term for the functionality, e.g. access entitlement.]
The act of ingestion extracts sufficient additional and/or improved metadata property values from the actual data content (or elsewhere) to reach a metadata status that ensures the ability to discover, consume, and improve the data content in the future. Frequently the act of ingestion triggers an enrichment activity which can process the data content to form new representations. [COMMENT: What does the following phrase mean: "with a normalize Frame of Reference and schema to facilitate search"?] NOTE: OSDU previously used the word enrichment to represent a type of ingestion step similar to extraction, but open to developing better metadata from sources other than the data content.]
OSDU supports the discovery and consumption (access) to existing OSDU information as well as the creation of derivative information that are enriched through processes that anyone can introduce and configure for execution within OSDU. Common enrichment processes include quality assurance (checking data content suitability for a defined process), mastering (composing information from multiple types of data content items), and the creation of new data content from the federation of existing sources. [COMMENT: What does 'federation of existing sources' mean?]
The ability to find data in OSDU based on different perspectives (criteria) utilizing metadata that was captured during ingestion. [COMMENT: Why did the earlier version also mention captured during enrichment? On the other hand, is Discover open to looking at non-indexed source data?]
Enrichment is the act of creating new data from existing and making this new data available in OSDU. Enrichment can happen at any time data is in OSDU. It can be triggered during ingestion functions, by explicit application request, or as part of a consumption function.
Data Flow for master and reference data
- All master and reference data originate as type specific data model instance records from external systems,
- It is secured/entitled to achieve isolation
- It uses its data model property values and indexing for discovery with general and type-appropriate data values.
- It can be refined and enriched producing new (derivatives) of the master and reference data with appropriate data version creation entitlements.
- It is exported with the quality, security and content data schema required by the consumer, often through the manifestation of a fit-for-purpose pond [COMMENT: Define pond if this applies to master and reference data.]
Data is kind of important in a data ecosystem
This is the functional view of data in the data ecosystem. It describes how new data types are added and represented within the system.
Consumption is the act of using information. Similarly to ingestion, consumption should not be mistaken for a single implementation (zone, framework) but a logical layer, present in all exit points in our system, governed by consumption contracts.
Consume is the act of using data in OSDU. It can be as simple as using data in place or as complicated as setting up a specific schema/security model/representation that is fit for purpose for the consuming application.