Currently, user needs to provide both the Reference Entity information and the persistable reference.
Evident when user needs to specify unit of measure or coordinate reference system.
This is inefficient and is also error prone.
What if the Data Loader or the User makes a mistake in persistable reference value and the values are inconsistent?
The proposal is to save the user this trouble. Let the user provide link to existing Reference entity and ID.
However, programs such as Manifest-based Ingestion could query Reference value and add required line in JSON file being used to actually store/populate record.
Debasis Chatterjeechanged title from Avoid the need to provide persistable reference (Unit system, Cooordinate Reference System to Avoid the need to provide persistable reference (Unit system, Coordinate Reference System
changed title from Avoid the need to provide persistable reference (Unit system, Cooordinate Reference System to Avoid the need to provide persistable reference (Unit system, Coordinate Reference System
Debasis Chatterjeechanged title from Avoid the need to provide persistable reference (Unit system, Coordinate Reference System to Avoid the need to provide persistable reference (Unit system, Coordinate Reference System) information
changed title from Avoid the need to provide persistable reference (Unit system, Coordinate Reference System to Avoid the need to provide persistable reference (Unit system, Coordinate Reference System) information
Debasis Chatterjeechanged title from Avoid the need to provide persistable reference (Unit system, Coordinate Reference System) information to Avoid the need to provide persistable reference information (Unit system, Coordinate Reference System)
changed title from Avoid the need to provide persistable reference (Unit system, Coordinate Reference System) information to Avoid the need to provide persistable reference information (Unit system, Coordinate Reference System)
As of now the normalizer code uses the FoR definitions in the meta[] – the self-contained persistableReference strings. The schema is prepared to carry the relationships to reference-data records, but these are currently not used by the normalizer.
The Unit (catalog) service and CRS catalog service are only used to look up definitions and copy the persistable reference strings into the respective data records (during data preparation). The OSDU reference-data records UnitOfMeasure and CoordinateReferenceSystem offer the same capabilities – and there are initiatives under way to make the Unit service and CRS Catalog service read from the reference-data records instead of the current hard-coded resources. The term ‘initiatives’ is deliberately elastic as there is no commitment for an implementation yet. Debasis knows more.
Debasis Chatterjeechanged title from Avoid the need to provide persistable reference information (Unit system, Coordinate Reference System) to [ADR] Avoid the need to provide persistable reference information (Unit system, Coordinate Reference System)
changed title from Avoid the need to provide persistable reference information (Unit system, Coordinate Reference System) to [ADR] Avoid the need to provide persistable reference information (Unit system, Coordinate Reference System)
It is still very unclear to me what the actual proposal to relieve the situation is.
What is the input 'to save the user this trouble' (i.e. providing the persistableReference string)?
The incoming data are not guaranteed to use e.g., Energistics Unit of Measure standard. Evidently many legacy LAS files refer to the unit F as depth unit. In the OSDU UnitOfMeasure this means Farad (capacitance). What I am trying to say is that this mapping is non-trivial and context dependent. In fact, Equinor submitted member GitLab schema#257. The branch associated with the issue contains a proposal, which is waiting for validation by Equinor.
I believe this ADR requires substantial elaboration and coordination with already ongoing activities in Data Definitions before it can be approved.
"Save user the trouble" - Today in Manifest-based Ingestion, if the user wants some fields to be converted from "ft" to "m" then he/she has to specify two rows - one row pointing to UnitOfMeasure entry in Reference data for "ft" and the other row with actual persistableReference (long string). My proposal is that the user could simply provide pointer to UnitOfMeasure entry in Reference data for "ft" and then the DAG/program will create the additional row (persistable reference) behind the scene when creating record using Storage service (PUT). This is "win win" and avoid any breaking change to existing programs.
We could make the "persistable reference" information optional. But then again, it opens up to usual problem from duplication of information. What if persistable reference from UnitofMeasure (Reference entity) is different from user-specific persistableReference?
Are you suggesting that - provide option for user to simply provide persistable reference in ingestion json/payload when the unit is non-standard and is not existing as Reference data?
When it comes to well log, curve unit of measure, that is yet another ball game. It is better to keep that "out of scope" for this subject. First that refers to bulk data which is handled only by Wellbore DDMS. Do you envisage need to normalize actual curve values such as when we have mix of porosity curve values in percentage (30% with limit of 0-100%) for some well log and fraction unit (0.3 with limit of 0-1) in some other well log?
Please let me know if the "ask" of this specific ADR is clear. If not then I am happy to connect with you by Slack or otherwise to discuss.
Clarify"Currently, user needs to provide both the Reference Entity information and the persistable reference" to "Currently, user needs to provide both the id of the Reference Value instance and the persistable reference" Rationale: This clarifies the scope: the input is the id, i.e. the relationship to the reference value record.
Clarify"The proposal is to save the user this trouble. Let the user provide link to existing Reference entity and ID." to _"The proposal is to save the user this trouble. Let the user provide the relation to the existing Reference Value instance id."_ Rationale: data.ID is an optional external identifier; it is not always present while the system property id is.
Clarify"You can find historical context in the following issues" to "The following historical issues will be used to request the implementation changes" Rationale: this clarifies this ADR's role as umbrella issue. Action is required for all the individual issues since the implementation affects multiple, independent services.
In the scope section, Unit Conversion appears as a service at the same level as CRS conversion. This is not directly comparable. There is no Unit conversion service; the only thing served is a ABCD or scale/offset parameter set holding the recipe for client unit conversion. The current Unit Catalog service implementation offers a [namespace,unit symbol] input to conversion parameters endpoint, which is satisfying the spirit of this ADR request. However, see next point for improvement of the ADR:
I suggest to restructure the scope:
Manifest-based Ingestion (both unit and CRS in meta[].persistableReference and in data.{SpatialPropertyName}.AsIngestedCoordinates.persistableReferenceCrs)
CRS conversion to also support reference value id as fromCRS and toCRS.
Catalog services
Unit Catalog service uses a hard-coded set of unit and measurement definitions, which are independent of OSDU's reference value content. It has to be noted that the current Unit Catalog service offers by far more legacy definitions then the clean Energistics Unit of Measure V1.0 content offered by OSDU. There is an initiative (see member GitLab #257 for details), which may bring OSDU reference value content at par with the current, hard-coded Unit Catalog service content. Once the data model has been decided, the rework of the Unit Catalog service can commence.
CRS Catalog service uses a hard-coded set of CRS and transformation definitions. This service implementation needs to be reworked to receive the content from the OSDU CoordinateReferenceSystem and CoordinateTransformation reference value records.
For this ADR, we need to clarify the backward compatibility behavior in case they are existing clients who still generate the payload file in the old way. We have the option for the manifest ingestion to:
-ignore the persistablereference always and generate its own (Preferred option for me)
-use the one in the file if there is one and generate only if there is none.
Also this should target M11 or after.
Finally, the comments of Thomas are relevant and we should also cover them.
Thanks @debasisc, I joined only in the later part of the call. I missed the new time set in the invite.
Going through the recording, I wonder if either of the approaches will resolve the one of the fundamental issue raised in the original description above. For example, how can we ensure that the ID and the Persistable Reference are in sync.
I would imagine that we treat the persistable reference as optional, and ingest the record without it (as long as the the ID is present). If any other program (such as normalizer or a consumer app which understands the text better than CRS ID), then the CRS look-up services can help with that and can fetched on demand. That way:
We keep the data consistent - no de-normalized or inconsistent entries
The data foot print is also small, no need to process and store repeated entries/text for 100s of records.
A variation of the above could be to use persistable reference text, if ID is not present. And to derive the ID and store it.
We could still do a trade of on approach 1 vs 2, but we may only be addressing part of the usability and data consistency issues. So I thought I'd note that here. Thank you.
I do not think id referenced pr (in the reference data) and pr stored with the data are expected to be in sync; but that the pr stored with the data is what was used. The rule is that id "trumps" the pr, and users cannot use the pr with the data if an id is used. The reason for storing the pr with the data is only (in my mind) that it was not allowable to remove it to avoid a breaking change; the opportunity for storing it with the data if an id is used is to provide an audit trail what explicit definition was used to normalize the data at the time of loading (it happens that reference data gets updated, or perhaps for whatever reason something went wrong then this provides a way to troubleshoot); and perhaps for some when data gets transferred to provide a definition to another party in case it gets orphaned (which also should never happen).
@alexnarayanan - True. Like what @bert.kampes is saying, the initial intent of this ADR is to give users a better experience and sparing them the trouble of providing the persistable reference. They can get by simply with providing an existing ID (be it from stock population from Data Definition team or from own company's custom CRS).
Bottom line, user will no longer provide two values, just one existing ID of CRS. Such as below -
The Ingestion process will do required lookup and backfill the persistable reference value in JSON manifest right before Storage/PUT which created record. Thus everyone in the "food chain" stays happy.
I think we would benefit from adopting SWOT analysis as part of developing ADRs. ADRs comes from performing trade-off analysis and by using SWOT based on claims, basically that the alternative under consideration is the "best" alternative and then fill in a table with Strengths, Weaknesses, Opportunities and Threats (god given risks) simplifies the decision making process. In addition it become transparent for others without the deep technological insight.
Thanks for your comments @bert.kampes , @debasisc, @elandre . I agree, a SWOT or trade-off is a good exercise to compare against all options. Such an analysis may also help surface any fundamental gaps and help us challenge some of the original decisions, so we can make the designs better.
I agree that storing PR (which is a function of CRS_ID, say PR = f(CRS_ID)) in addition to CRS_ID in Storage & Index records, will be convenient look-up for any consuming app and serves up the "food chain". However I believe there will be trade-offs in doing so., for example this ADR presents a few already - inefficient, error prone, ease-of-use. Which led me to think why store PR for every storage record.
A thorough analysis & comparison can help us find that best alternative.
This is about transition, @alexnarayanan, storing the persistableReference string is where we are coming from, the only definition that makes the normalization work today.
migrating over to using the reference-data relationship via ID is the desire of the community.
don't burn the bridge while walking over it - set the rules on either side of the bridge.
eventually, define a breaking change schema which only carries the reference-data relationship via ID.
We have to accept that this transition will take time and we need to live with this potential confusion for quite a while until everybody is ready to adopt the 2.0 version of all schemas.
@chad - Do you know of any recent change in Indexer that it is performing normalization (CRS values from asIngested to WGS84) even if user is not providing persistable reference information?
I did a test as @bert.kampes and I were discussing this subject.
There's no change as far as I'm aware of - quick check on your test case; I get this error in the indexer:
{"index":{"trace":["CRS conversion: bad request from crs converter, no conversion applied. Response From CRS Converter: {\"code\":400,\"reason\":\"Error\",\"message\":\"Bad request\"}."],"statusCode":400,"lastUpdateTime":"2023-05-22T07:59:40.899Z"}}
Could you test two variations. First, when the record has no PR, confirm it is pulled from the ref data and populated. Second, a test to respond to Bert's comment about the value of retaining the PR and not writing over it, which agrees with IOGP principles. I think a test could be as simple as modifying a PR, then updating the record, and confirming the original PR is retained.
THomas - Yes, the record exists and has persistable reference.
Chad - There is no indexing error.
My point is - does this mean the Data Loader no longer needs to provide persistablereference string in JSON payload at the time of creating record? Then this is great news and this ADR can be closed after we check the behaviour in all 4 CSPs.
Reviewed the ADR note. I see loads of agreement 2 years to 5 months ago from everybody contributing in the Comments (overwhelming agreement to use UnitOfMeasureID yo pull persistableReference metadata from the ref list), but no indication anything was actually done, so the problem is still outstanding. Unfortunately, I don’t seem able to contribute to these Issues as my OSDU Gitlab login does not seem to work against these. The problem remains in M18 and we are being told to populate persistableReference as a workaround to the Search Normalizer problem. My objections to doing this are noted in Storage Issue #188.
My opinion on the Normalizer Java code change is to support both UnitOfMeasureID & persistableReference as follows based on honouring the UoM Ref List as the ‘source of truth’ for OSDU Unit definitions:
Where UnitOfMeasureID is populated in Meta[], the Normalizer uses the ID to perform lookup into UnitOfMeasure Reference List to obtain persistableReference
Where UnitOfMeasureID & persistableReference is populated in Meta[], the Normalizer uses the ID to perform lookup into UnitOfMeasure Reference List to obtain persistableReference i.e. persistableReference in Meta[] is ignored
Where UnitOfMeasureID is not populated but persistableReference is populated in Meta[], the Normalizer uses the Meta[] content to index the records
This would support leveraging the UoM Reference list as the ‘source of truth’ for UoM Definitions (as it must be) as well as avoid the issues described in Issue #188 (gargantuan bloating of all records in OSDU and increased Ingestion complexity/ETL overhead) , as well as support legacy records only populated w/ persistableReference in Meta[]. I would also like to see the OSDU Release Documentation & any OSDU Developer Training Materials clearly describe the change (large bold font) stating that persistableReference is no longer required to be populated in Meta[] give resolution in Mxx (hopefully M22).