EDS Data Management Service
We are often left to address the gaps from architectural principles (which stay at a pretty high and abstract level) to the actual implementation detail. Here is an attempt to bridge that gap by providing a set of Lightweight Architecture Decision Records (LADRs) which are simple to follow and can be implemented in a given team/project by the developers
EDS DMS Iteration 1
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
A data management service is needed to handle the retrieval of external datasets from Data providers. The user in the local OSDU platform will have identified datasets for which they want to view the full data. These datasets’ metadata will have already been fetched and ingested into the local OSDU platform. It is the retrieval of the actual data behind the datasets that this service will be concerned with.
An OSDU environment will have ConnectedSourceDataJob records and ConnectedSourceRegistryEntry records as outlined here: https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/extern-data/docs/-/tree/master/Design%20Documents/Schemas
These records contain information about an external source such as service account information that will be used in the EDS DMS’s process of connecting and retrieving data from an external source’s dataset api.
Additionally, datasets with an abstract external data sources schema will be routed to the EDS DMS by the dataset service providing a seamless experience for the user.
Proposed request body to the local EDS DMS from local user :
{
"datasetRegistryIds": ["id-1"]
}
Example external dataset record:
{
"id": "{{data-partition-id}}:dataset--External:some-guid3",
"kind": "{{data-partition-id}}:wks:dataset--External:1.0.0",
"data": {
"Name": "name",
"DatasetProperties": {
"ConnectedSourceDataJobId": "opendes:ConnectedDataSourceJob:some-guid",
"ConnectedSourceRegistryEntryId": "opendes:ConnectedSourceRegistryEntry:some-guid1",
"SourceDataPartitionId": "opendes",
"SourceRecordId": "opendes:dataset--File.Generic:c7153a1da6c94e6dbad70f84f89de95a"
}
},
"namespace": "{{data-partition-id}}:osdu",
"legal": {
"legaltags": [
"{{data-partition-id}}-public-usa-dataset"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"acl": {
"viewers": [
"data.default.viewers@{{data-partition-id}}.testing.com"
],
"owners": [
"data.default.owners@{{data-partition-id}}.testing.com"
]
}
}
Proposed response body from the local EDS DMS, which is the standard response from dataset service as well (AWS example, dataset service retrieval instructions differ by CSP):
{
"delivery": [
{
"datasetRegistryId": "id-1",
"externalRetrievialInstruction": {
"signedUrl": "",
"signedUrlExpiration": "",
"unsignedUrl": "",
"createdAt": "",
"fileName": "",
"connectionString": "",
"credentials": "",
"region": ""
}
}
]
}
The “externalRetrievalInstruction” differs based on the type of data being requested as it is for everything with the dataset service. The example shown above is for a file dataset.
There will be little CSP implementation needed for this service since it exists merely as a proxy to external systems. The only CSP-specific implementation is regarding retrieval and use of secrets which logic exists in other OSDU services. Other than that, only devops work should be required for a cloud provider to make this available in their platform once development is complete.
The assumption with EDS DMS is that the external retrieval instructions it gets from a data provider is retrieved from a dataset service or a service that has the same api as the dataset service.
There will be a need to cache bulk data in the local OSDU environment in the future but for the initial iteration of EDS DMS it will only be concerned about the external retrieval. Potentially, logic for helping to enforce licenses may exist in EDS DMS in the future but not in this iteration.
Please see this design for more info: https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/extern-data/docs/-/blob/master/Design%20Documents/OSDU%20on%20AWS%20-%20EDS%20DMS.pptx
Decision
Accept design for iteration 1 to enable development
Rationale
- Enable retrieval of external data
- Execute retrieval seamlessly without any burden on user’s part
- Implement caching in future update
Consequences
It's the first microservice required of cloud providers to enable EDS support. This is a greenfield application required for EDS to work.
When to revisit
Once the first iteration gets some cycles from users we can better know how to implement caching and revisit the design for iteration 2.
Tradeoff Analysis - Input to decision
EDS will be a major contribution to OSDU success, without a way for customers to actually retrieve external data, EDS functionality won’t work.
Alternatives and implications
Decision criteria and tradeoffs
Decision timeline
Decision ready to be made