Osdu_Ingest - Provide additional integrity check to catch inconsistencies in denormalized data (Ex: Master Entity "Play")
When looking at an example provided by Development team (CSV Ingestion), I have this question regarding Master Data (Play) definition.
data.GeoContexts[].BasinID -> Basin
data.GeoContexts[].GeoTypeID -> BasinType
Isn’t the second field unnecessary (as one can find out that information from Master record Basin itself)? "BasinTypeID": "namespace:reference-data--BasinType:ArcWrenchOceanContinent:"
In addition, we open up possibility of conflicting information by offering two separate fields in “Play” Master record.
So, suitable integrity check is required.
See notes from @gehrmann -
Hi Debasis,
The schema is de-normalised to support queries by BasinType/GeoPoliticalEntityType/...
Yes, every de-normalisation had the risk of introducing contradictions. This was considered as a trade-off - and considered worthwhile in the interest of easier query handling.
Finally, it is possible to organise the master-data as parent-child structures with self-references. This is easiest understood with GeoPoliticalEntity hierarchy: country, state, county,...
Best regards, Thomas
Additional notes from Thomas
Hi Debasis,
whether or not the extra validation during ingestion is sufficient - I am not so sure. Basin, Play, Prospect, GeoPoliticalEntity are all master-data and therefore subject to continuous improvement. I would think a generic set of data quality rules, which can be re-evaluated after any change, might be a better choice.
The schema, by the way, does mark derived properties (=de-normalised properties) - please check the schema definitions with the dedicated extension tag x-osdu-is-derived:
Example for AbstractGeoBasinContext:
"x-osdu-is-derived": {
"RelationshipPropertyName": "BasinID",
"TargetPropertyName": "BasinTypeID"
}
In other words: the property GeoTypeID is derived via the sibling property BasinID linking to the target object's property BasinTypeID.
This decoration has been done in other places as well. It should be possible to create a generic implementation of a quality rule covering all of the derived/de-normalised values.
Best regards, Thomas