Note: The following documentation is still in SLB's vocabulary and not updated to reflect OSDU's
How DDMS Fits into Data Ecosystem
In its simplest depiction, DELFI Data Ecosystem comprises of Domain Data Management Services (refer as DDMS later), Core Services, and number of Mandatory and Helper services. Domain Data Management Services are contributed and managed by domains. Data Ecosystem contributes to Core services used for registration and data discovery and Mandatory services used to enforce additional attributes reflected in Architectural principles. Data Ecosystem contributes to Helper services as well, used to implement cross-cutting non-mandatory functionalities of Domain Data Management Services.
A conceptual implementation of DELFI Data Ecosystem is a hub having centralized deployment of Core Services (HUB service below) and different Domain Data Management Services leveraging re-deployable Mandatory and Helper services, which will enable data sharing across workflows, and facilitate enforcing compliance, security and data lineage.
All other aspects of the domain data will be implemented by the DDMS, including how data is stored, merge/enrichment rules, and consumption patterns.
A DDMS meets the following requirements, further classified into capability, architectural, operational and openness/extensibility requirements:
|1||Data can be ingested with low friction||Need to seamlessly integrate with systems of record, to start with the industry standards||Capability|
|2||New data is available in workflows with minimal latency||Deliver new data in context of the end-user workflow – seamlessly and fast.||Capability|
|3||Domain data and services are highly usable||The business anticipates a large set of use-cases where domain data is used in various workflows. Need to make the consumption simple and efficient||Capability|
|4||Scalable performance for E&P workflows||E&P data has specific access requirements, way beyond standard cloud storage. Scalable E&P data requires E&P workflow experience and insights||Capability|
|5||Data is available for visual analytics and discovery (Viz/BI)||Deliver minimum set of visualization capabilities on the data||Capability|
|6||One source of truth for data||Drive towards reduction of duplication||Capability|
|7||Data is secured, and access governed||Securely stored and managed||Architectural|
|8||All data is preserved and immutable||Ability to associate data to milestones and have data/workflow traceable across the ecosystem||Architectural|
|9||Data is globally identifiable||No risk of overwriting or creating non-unique relationships between data and activities||Architectural|
|10||Data lineage is tracked||Required for auditability, re-creation of the workflow, and learning from work previously done||Architectural|
|11||Data is discoverable||Possible to find and consume back ingested data||Architectural|
|12||Provisioning||Efficient provisioning of the DDMS and auto integration with the Data Ecosystem||Operational|
|13||Business Continuity||Deliver on industry expectation for business continuity (RPO, RTO, SLA)||Operational|
|14||Cost||Cost efficient delivery of data||Operational|
|15||Auditability||Deliver required forensics to support cyber security incident investigations||Operational|
|17||Domain-Centric Data APIs||Openness and Extensibility|
|18||Workflow composability and customizations||Openness and Extensibility|
|19||Data-Centric Extensibility||Openness and Extensibility|
Data can be ingested with low friction
Ingestion from on-prem or hosted systems (file storage, databases, etc.) must be made low friction through applying dropfile, scripts, and/or agents.
The DDMS prioritizes the support for ingestion workflows from:
Systems of Record and Application stores
E&P industry standards (see Appendix C with examples)
3rd party Systems of Record and Application stores
New data is available in workflows with minimal latency
The latency from the point of data ingestion (or data generation) to data available in workflows should be minimal. Consequently, a DDMS should support multiple access patterns e.g. small amount of data wherein it becomes available for consumption instantly, large amounts of data where data becomes available for consumption incrementally or if domain logic demands that it be all available at once, then after a reasonable time interval.
Domain data and services are highly usable
The DDMS must register itself and be discoverable by workflows looking to consume data from that domain. Furthermore, the DDMS must provide a standardized API for consuming its data.
The DDMS must own and manage all enrichments within the domain.
Note: Even if multiple representations of the same data need to exist to support different domain workflows, access control and legal constraints apply to the conceptual entity, not just one of its representations: the durable identity of data items is therefore key, regardless of representation.
Consumable: provide shareable widgets that can be deployed in other solutions that can be invoked on these enrichment data entities. Example: “my team’s most recent reservoir models”, or “top-5 most popular seismic cubes”, or “latest approved FA models”.
Scalable performance for E&P workflows
Our final goal is that our customers can choose the optimal point between cost and performance. A DDMS should have a roadmap to achieve this goal. First step should be to provide acceptable performance for mainstream usage patterns, both within the domain and cross-domain. Gradually, as requirements for scale are received, a DDMS should start parallelizing the choke points, and finally expose a model which allows user selection.
Data is available for visual analytics and discovery (Viz/BI)
Identifying relevant data (discovery) and understanding patterns in data (charts, dashboards) requires that all data in the ecosystem including raw and results are available for these workflows. This requires the ability to pick attributes and relationships across domains - real value of BI is cross domain not single domain.
One source of truth for data
For data, the domain data management service provides the ultimate source of truth. The DDMS abstracts where it sources the data from (systems of record, or systems of reference etc.), how it stores the data (including more than once for cost or performance reasons), and what domain knowledge-based transformations it chooses to apply. Note that copies of data are possible in different consumption zones (e.g., data ponds) for multiple workflows, and also to support features like search and visualization.
Data is secured, and access governed
Only authorized identities have access to data. Data compliance is enforced in all entry/exit points.
Entitlements: Leverage the DE entitlement service to define ACL’s (NB: These need to be managed centrally to ensure consistency and auditability). There is a central control plane but for performance reasons, enforcement can be decentralized.
Encryption: Data must be stored encrypted. Allow for storage models where customers can bring their own encryption key.
Customer data isolation: design the deployment structure following best practices on separating customer data tenants and shared service tenant.
Compliance: Leverage the DE compliance service to define data tagging. (NB: These need to be managed centrally to ensure consistency and auditability). There is a central control plane but for performance reasons enforcement can be decentralized.
Audit logging: Implement required granularity of logging for cyber security forensics, including the essential attributes to capture in the audit log (caller, user, tenant, timestamp, action, kind(s), ids, response status)
All data is preserved and immutable
Ingestion and retention minimize data loss for raw data. Metadata should be captured for all data without exception. [By exception, transient data (limited to the specific workflow for a specific user persona) is subject to cost considerations]. Versioning preferred.
Keep original: as data gets ingested into the DDMS, ensure the raw input data is preserved, either through explicit storage, or by referencing to the System of Record which has the responsibility to persist the original (driven by cost considerations)
Ingestion: when the data is loaded again, a latest version of the record/data is to be created, and the version number increased. If the newly loaded data is the same as previous versions of the data (for example as proven by the CRC of a seismic cube), then the ingestion should be omitted (a record of the reload activity may be stored).
Improving data: Improved data is new data. Enrichment is a workflow which results in new data.
Central registration: Any data generated in the app should be registered in the data ecosystem with appropriate legal, security tags and metadata for discoverability. For example, to enable machine-learning the system needs to be able to understand why an interpreter chose the best result out of the others, so the context is important to improve “data trust” and to facilitate ML in future. This requires the data ecosystem to be aware of all data not just end results.
Handling bulk data: there are scenarios where the bulk data behind the data record is moved to cold storage, offline, or even gets deleted (given it can be re-created). The choice to archive or re-create is a decision of the domain team. The data record continues to exist in the Data Ecosystem.
Deleting data: Data records will not be purged, instead a soft-delete will be performed, making the record and associated bulk data unavailable to the users. If the customer requests a hard-delete, then the record needs to keep the proof of the deletion and its bulk data needs to be scrambled (encrypted by a random key for example) and a delete action is to be performed by the cloud vendor.
Data is globally identifiable
All data must be identified by a globally unique durable identifier assigned by the Data Ecosystem. Domains must provide matching rules for identifying similar data ingested from various sources. Use of a context-specific data identity is strongly discouraged as it prevents leveraging data across workflows and provide interchange. Each data item must at the least provide basic audit attributes in addition to the identity for tracking – creation date, creation user and instance version (if unique like time-stamp, else leave system to generate one).
Required: Leverage the DE identity service to ensure all data items in your service have a global unique ID across DELFI.
Data lineage is tracked
All transformations and workflows must provide lineage. It must be possible to track all derived data from original data (for example to honor subscription expirations where derived data must be archived) and to track different versions of input data loaded at different points of time.
Required: the moment data is transformed from its raw input into another representation/storage, the new data set will refer to the raw input (lineage).
Data is discoverable
All domain data management services provide data ready to index. Users want awareness of all data in DELFI and the associated data quality (measured actuals vs interpreter work-in-progress vs qualified best results etc.) – This implies we need an index that spans both work-in-progress and final data – quality should convey if a data is trust-worthy and suitable for a down-stream workflow.
Discoverability of data requires a perspective of the data kind that is conducive for search. Typically the perspective is defined by the domain that owns the kind, but there are situations where the kind could be an intersection between several domains (example: wellbore). In such cases, though a domain may have a view specific to their discipline, there must be a mapping provided to the discovery perspective (also known as well-known entity).
Required: implement the DE search service (Elastic). DE will consolidate these indexes to a common (global search) discovery.
The DDMS deployment/provisioning is automated and implements CICD independent of other parts of the Data Ecosystem. The DDMS registers its services and integrates with the Data Ecosystem core components. We should strive to gain control over the number of independent projects and data scopes for a given tenant by aggregating it at least at a domain/discipline level.
Business continuity (RTO, RPO, SLI)
The services are designed to deliver business continuity: Disaster recovery and Backup/Restore policies. The service delivers on the industry expectations around RTO and RPO.
Disaster Recovery: Ability to deploy multi-zone, multi-region
Data Residency: Ability to restrict cross-region deployment (in case of residency constraints)
RPO (Recovery Point Objective) limits how far to roll back in time and defines the maximum allowable amount of lost data measured in time from a failure occurrence to the last valid backup.
To protect against data center failure: implement multi-region support
To protect against accidental data loss/deletion/service failure: implement data versioning, perform regular backup, depending on the tiering of the data criticality (for example customer data <X min)
Understand how you generate your backup (completeness, consistency of data), and validate it can be restored (requires drills)
RTO (Recovery Time Objective) is related to downtime and represents how long it takes to restore from the incident until normal operations are available to users
To optimize recovery time, implement multi-region support
Investigate ways to incrementally restore data from backup
Ability to redeploy the services in another zone/region may this be required
The recovery metrics associated with IT Service Continuity are:
SLI (Service Level Indicators). Instrument the DDMS with service level indicators driving the understanding of availability of the service.
Cost is another aspect that is unique to cloud, as cost of copies is something we must cope with as a service provider. Therefore, services must be designed to minimize copies (of the same version) as much as possible and optimize for cost as data volumes increase. This includes policies for offline and cold storage and the use of cost sensible Cloud vendor storage methods.
The service allows for data access auditing and will support the essential attributes required for cyber security incident forensics. See also #7.
Users should be able to access data from any domain even if they don’t have a subscription to the domain workflow/solutions – this ensures that any domain data is open and accessible in the DELFI data ecosystem
Required: all data services must be accessible under one profile (ingest, discover, view). There will be no additional subscription control on different parts of the Data Ecosystem (for example, a seismic interpretation will be able to discover and use time series data. A drilling engineering will be able to discover and use geological models without any constraints imposed by the workflow app and/or associated subscription).
Multi-tenant support with separation of services and data (isolated by tenant).
Automation of tenant deployment
Availability of the domain data management service is independent from the domain workflow or associated subscription.
Openness and Extensibility Principles
Domain-Centric Data APIs
Need to provide APIs for data types (well, logs, tops etc.) that are domain-oriented and independent of the domain workflow. These APIs should be published in the DELFI developer portal and API gateway. This allows partners, customers to ingest or manipulate data in DELFI and ensure that all domain solutions can benefit from data produced. We want to minimize APIs that are specific to Petrophysics or Geology or Drilling for the same entity. This ties to the definition of a domain. Each team (or worse a workflow) is not a domain.
Workflow composability and customizations
Need the ability for customers to change or replace workflow steps in DELFI – for example replacing the log curation workflow in Swan. This implies that Data APIs should allow custom extensions to manipulate data in domain workflows.
We want to take components developed in different centers (correlation, forecasting and mapping, etc.) when the user creates new data we need the ability for this data to be available by other components without having to wait or push/pull data and have the canvases reflect the new data as quickly as possible.
Also, if the user needs to update a KB it should be done once not in multiple places and all workflows should reflect the correction immediately.