Indexer issues

Augmented Index - Use Case 3 (Wellbore Name) not working in M22 Preship Envir...

2024-03-07T11:27:28Z

Augmented Index - Use Case 2 (Country Names) not working in M22 Preship Envir...

2024-03-07T11:27:59Z

I was testing out the augmented index feature on the M22 Preship environment. I was trying to implement the use cases documented in this [tutorial](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/docs/tutorial/IndexAugmenter.md#use_cases). Use cases 1 and 5 worked for me. Use case 2 did not, as the field `CountryNames` wasn't coming out of search after reindexing. I also tried this on `osdu:wks:master-data--Wellbore:1.0.0` but replaced the item in `RelatedConditionMatches` with `^[\\w\\-\\.]+:reference-data--GeoPoliticalEntityType:Province:$` and the Name to `ProvinceNames`, but the custom `ProvinceNames` field is not appearing. Please see below the reference data I used: ``` [ { "acl": { "owners": [ "{{New_OwnerDataGroup}}@{{data-partition-id}}{{domain}}" ], "viewers": [ "{{New_ViewerDataGroup}}@{{data-partition-id}}{{domain}}" ] }, "legal": { "legaltags": [ "{{LegalTagNameExists}}" ], "otherRelevantDataCountries": [ "US" ], "status": "compliant" }, "meta": [], "data": { "Code": "osdu:wks:master-data--Wellbore:1.", "Configurations": [ { "Name": "ProvinceNames", "Policy": "ExtractAllMatches", "Paths": [ { "RelatedObjectsSpec": { "RelatedObjectID": "data.GeoContexts[].GeoPoliticalEntityID", "RelatedObjectKind": "osdu:wks:master-data--GeoPoliticalEntity:1.", "RelatedConditionMatches": [ "^[\\w\\-\\.]+:reference-data--GeoPoliticalEntityType:Province:$" ], "RelatedConditionProperty": "data.GeoContexts[].GeoTypeID" }, "ValueExtraction": { "ValuePath": "data.GeoPoliticalEntityName" } } ], "UseCase": "As a user I want to find objects by a province name." } ] }, "id": "{{data-partition-id}}:reference-data--IndexPropertyPathConfiguration:wks:master-data--Wellbore:1.", "kind": "osdu:wks:reference-data--IndexPropertyPathConfiguration:1.0.0", "version": 0 } ] ```

Unable search ingested records using search api

2024-03-08T12:30:33Z

Hi Team, I am not able to search artifacts using a search service. However, I can able to search using a storage service. Attached the Dag ingested success response, empty search response.![Dag_Success_result](/uploads/0a5813e7d6d4d8f6a08052fc23ef8852/Dag_Success_result.png) ![image__3_](/uploads/6638d7ff797274aaacc35e7a09e9dbf3/image__3_.png) ![search_result_](/uploads/d316028de45b17c3be16e9d4a575c5c1/search_result_.png)

Augmenter throws null pointer exception when casting the related object ids r...

2024-02-26T11:47:11Z

Augmenter throws null pointer exception when it tried to get the reference id and the reference object id is null. The bug was introduced by the MR [620](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/620) included in M21. The issue was discovered in M22 deployment when augmenter tried to cast a null reference object id from GoeContext object. GeoContext has 5 reference object id properties defined in Schema. In each GeoContext, only 1 reference object id is not null. Here is the example: - Format in Storage record ``` { "GeoContexts": [{ "GeoPoliticalEntityID": "opendes:master-data--GeoPoliticalEntity:111111:", "GeoTypeID": "opendes:reference-data--GeoPoliticalEntityType:LicenseBlock:" }, { "FieldID": "opendes:master-data--Field:444444:" } ] } ``` - Format in index record ``` { "GeoContexts": [{ "BasinID": null, "FieldID": null, "PlayID": null, "GeoPoliticalEntityID": "opendes:master-data--GeoPoliticalEntity:111111:", "GeoTypeID": "opendes:reference-data--GeoPoliticalEntityType:LicenseBlock:", "ProspectID": null }, { "BasinID": null, "FieldID": "opendes:master-data--Field:444444:", "PlayID": null, "GeoPoliticalEntityID": null, "GeoTypeID": "Field", "ProspectID": null } ] } ``` With the bug introduced, the Augmenter considers the related object id "FieldID" has value without checking whether the value is null or not before casting it to String. In this case, a NullPointerException is thrown. The augmenting for the record will fail though it does not affect the normal indexing.

Too many results returned after bagofwords feature

2024-01-19T19:47:34Z

Hi, When enabling the [BagOfWords feature](https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/113), some search query with a "query" filter return too many results. I've reproduced the issue on several AWS environment, and I don't have this issue if the indexer is deployed with the Feature flag `featureFlag.bagOfWords.enabled` set to False. I have attached the 3 records and schema I used (these are from the `os-search` integration tests in `testing/integration-tests/search-test-core/src/main/resources/testData/records_1.json`) [records.json](/uploads/196fce2d3f739b3c4349bd4e5075aeed/records.json) [schema.json](/uploads/990d8ac4242d6a09921e16236f6a72e5/schema.json) ( I didn't delete these 3 records from the `main.osdu-gl.osdu.aws` environment, so if you have access to it, you should be able to reproduce these queries ) Once the records are indexed : Issue a `search` query with the following payload: ``` { "kind": "opendes:search1704732571020:test-data--Integration:1.0.1", "query": "OFFICE9" } ``` I have all 3 records returned, instead of 0 (there are no "OFFICE9" text in the 3 records) Same if I use a "valid" query matching at least one record, for example ``` { "kind": "opendes:search1704732571020:test-data--Integration:1.0.1", "query": "OFFICE4" } ``` Also returns 3 records instead of one. This issue seems to occurs only when using digit suffix. If I use a letter, it works properly, for example ``` { "kind": "opendes:search1704732571020:test-data--Integration:1.0.1", "query": "OFFICEZ" } ``` Properly returns 0 results. I have managed to reproduce the issue directly on the elasticsearch server by using their REST API, so the issue is not with the Search service I think : POST https://localhost:9200/opendes-search1704732571020-test-data--integration-1.0.1/_search (I'm using k8s port-forwarding to dircetly connect to the ES server) with the following payload ```{ "from": 0, "size": 10, "timeout": "1m", "query": { "bool": { "must": [ { "bool": { "must": [ { "query_string": { "query": "OFFICE9" } } ], "adjust_pure_negative": true, "boost": 1.0 } } ] } } } ``` Returns 3 results when BagOfWords is enabled, only 1 if not.

Datetime formatting/parsing issues result in field not appearing in search index

2024-03-04T12:23:12Z

**Subject:** Certain "date" type attributes unavailable via SEARCH API but available by STORAGE API QA team just highlighted that some “date” related fields have gone missing again from the SEARCH API services. I have posted the snapshot below. Please note that no schema updates/changes has happened. QA (as end users) are ingesting and retrieving data (CRUD) to and from the schema.\ \ { "kind": "tenant1:wks:work-product-component--Sheet:1.0.0", "query": "\\"tenant1:work-product-component--Sheet:d92b4ff85fd040dba9009209e85a3c31\\"" }\ \ Through SEARCH: ![Search.png](/uploads/8c66dfb8d694298852a39f7d7eb50918/Search.png)Through STORAGE: ![Storage.png](/uploads/7af8e15227941a43f1a3a8b6440931aa/Storage.png)This is fixed by https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/694

String array becomes String after index

2024-01-24T08:54:10Z

The String array becomes String after it is indexed. Bug should be introduced by [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649) To illustrate the problem, I used one example from Augmenter Configuration that has String array attributes. - Storage Format of part of data payload: ![image](/uploads/9dacff15a729788fffb02e916b704569/image.png) - Index (document) Format of part of data payload returned by method in class StorageIndexerPayloadMapper ``` public Map mapDataPayload(ArrayList asIngestedCoordinatesPaths, IndexSchema storageSchema, Map storageRecordData, String recordId) { Map dataCollectorMap = new HashMap<>(); //.. mapDataPayload(storageSchema.getDataSchema(), storageRecordData, recordId, dataCollectorMap); //... return dataCollectorMap; } ``` ![image](/uploads/dfe1df18988936c5b137c542edd58c96/image.png) - Search result before re-index from local indexer service with the [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649): ![image](/uploads/7714272f8aa0286c90b278e7546d8b33/image.png) - Search result after re-index from local indexer service with the [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649): ![image](/uploads/b51dbecdc83cc6279b71017d1f8f1b61/image.png)

The augmented attributes are not searchable

2024-01-26T18:29:27Z

A common issue as mentioned in the [IndexAugmenter.md](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/docs/tutorial/IndexAugmenter.md), the augmented attributes are not searchable. It requires the records of the augmented kind(s) to be re-indexed. It is understandable that in order to augment the existing records, it needs to re-index the existing records of the augmented kind(s). However, there is a common scenario, data admins/managers want to verify the effect/result of the augmenter configuration immediately after they deploy the augmenter configuration. They normally don't have permission to trigger re-index. Furthermore, in this scenario, they should not trigger re-index of the whole kind(s) before they finalize the augmenter configuration for given kind(s). If the indexer can automatically update the schema mapping of the augmented kind(s) in the ElasticSearch when it detects that the augmented configuration was updated. Then data admins/managers can see the effect/result of the augmenter configuration immediately by updating one of the existing data records or inserting a new data record. It will tremendously reduce the time on troubleshooting as well as developing and deploying new/updated augmenter configurations. Given updating the schema mapping of the augmented kind(s) in the ElasticSearch is a lightweight operation as comparing to re-index the whole kind(s), I think it is worth to make this enhancement.

Augmenter can't recursively resolve the schema (property/type pair) of the au...

2023-12-19T15:40:02Z

Augmenter is supposed to be able to augment the properties from other augmented properties when updating schema mapping and creating the Document for Elasticsearch. We illustrate the issue using the following examples: ### Well has augmented properties `CountryNames` from kind `osdu:wks:master-data--GeoPoliticalEntity:1.` and `WellUWI` from itself ``` { "Name": "Well-IndexPropertyPathConfiguration", "Description": "The index property list for master-data--Well:1., valid for all master-data--Well kinds for major version 1.", "Code": "osdu:wks:master-data--Well:1.", "AttributionAuthority": "OSDU", "Configurations": [{ "Name": "CountryNames", "Policy": "ExtractAllMatches", "UseCase": "As a user I want to find objects by a country name, with the understanding that an object may extend over country boundaries.", "Paths": [{ "RelatedObjectsSpec": { "RelatedConditionProperty": "data.GeoContexts[].GeoTypeID", "RelatedConditionMatches": [ "opendes:reference-data--GeoPoliticalEntityType:Country:" ], "RelatedObjectID": "data.GeoContexts[].GeoPoliticalEntityID", "RelatedObjectKind": "osdu:wks:master-data--GeoPoliticalEntity:1.", "RelationshipDirection": "ChildToParent" }, "ValueExtraction": { "ValuePath": "data.GeoPoliticalEntityName" } } ] }, { "Name": "WellUWI", "Policy": "ExtractFirstMatch", "UseCase": "As a user I want to discover and match Wells by their UWI. I am aware that this is not globally reliable, however, I am able to specify a prioritized AliasNameType list to look up value in the NameAliases array.", "Paths": [{ "ValueExtraction": { "RelatedConditionProperty": "data.NameAliases[].AliasNameTypeID", "RelatedConditionMatches": [ "opendes:reference-data--AliasNameType:UniqueIdentifier:", "opendes:reference-data--AliasNameType:RegulatoryName:", "opendes:reference-data--AliasNameType:PreferredName:" ], "ValuePath": "data.NameAliases[].AliasName" } } ] } ] } ``` ### Wellbore has augmented properties `CountryNames` and `WellUWI` from kind `osdu:wks:master-data--osdu:wks:master-data--Wellbore:1.` ``` { "Name": "Wellbore-IndexPropertyPathConfiguration", "Description": "The index property list for master-data--Wellbore:1., valid for all master-data--Wellbore kinds for major version 1.", "Code": "osdu:wks:master-data--Wellbore:1.", "AttributionAuthority": "OSDU", "Configurations": [{ "Name": "CountryNames", "Policy": "ExtractFirstMatch", "UseCase": "As a user I want to discover Wellbore instances by the well's name value.", "Paths": [{ "RelatedObjectsSpec": { "RelatedObjectID": "data.WellID", "RelatedObjectKind": "osdu:wks:master-data--Well:1.", "RelationshipDirection": "ChildToParent" }, "ValueExtraction": { "ValuePath": "data.CountryNames" } } ] }, { "Name": "WellUWI", "Policy": "ExtractFirstMatch", "UseCase": "As a user I want to discover Wellbore instances by the well's UWI value.", "Paths": [{ "RelatedObjectsSpec": { "RelatedObjectID": "data.WellID", "RelatedObjectKind": "osdu:wks:master-data--Well:1.", "RelationshipDirection": "ChildToParent" }, "ValueExtraction": { "ValuePath": "data.WellUWI" } } ] } ] } ``` When the indexer tries to resolve the schema for `Wellbore`, the resolved schema should include both `CountryNames` and `WellUWI`. However, in current implementation, the resolved schema for `Wellbore` does not include augmented properties `CountryNames` and `WellUWI`. At the result, these two properties are not searchable though their values are created in the `Wellbore` records.

Avoid using query by cursor if possible

2023-12-02T13:58:46Z

In M20, we created a MR [601](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/601) that tried to improve the performance of the augmenter and reduce the usage of the query with cursor. With the MR, we only have two places (getting related children records) that use query with cursor. However, it is expensive to use query with cursor, it allows max. 500 queries with cursor within one minutes in most of the Elasticsearch deployments. The reason that we still use queries with cursor is that the normal queries can return max. 10,000 records. When trying to fetch children records for a given set of parent records, we are not sure whether the returned results will exceed the 10,000. During our stressful tests with large datasets, we found that there are lots of errors from the queries with cursor when re-indexing 100k wellbores that have 5M welllogs in total (each wellbore has 50 welllogs on average). Based on our knowledge on Augmenter, more than 99% of cases that the query results won't reach 10,000 records. We need to find a way to ensure both correctness (no result missed) and error-free from the queries. The basic idea is that Augmenter will use normal queries by default. In case the totalCount from the query result reaches the limit (10000), query with cursor will be automatically kicked in.

The RelatedConditionMatches of the augmenter is not flexible

2023-10-03T07:51:41Z

Current implementation of the RelatedConditionMatches in the augmenter has following limitations: 1. The condition match is text match only. The following two cases demonstrate that the regular expression match is needed: ##### Case 1: Extend the properties from the related objects whose IDs are defined under data.LineageAssertions[].ID ``` { "Name": "Document-IndexPropertyPathConfiguration", "Code": "osdu:wks:work-product-component--Document:1.", "AttributionAuthority": "OSDU", "Configurations": [{ "Name": "AssociatedFacilityNames", "Policy": "ExtractAllMatches", "Paths": [{ "RelatedObjectsSpec": { "RelationshipDirection": "ChildToParent", "RelatedObjectID": "data.LineageAssertions[].ID", "RelatedObjectKind": "osdu:wks:master-data--Wellbore:1.", "RelatedConditionMatches": [ "^[\\w\\-\\.]+:master-data\\-\\-Wellbore:[\\w\\-\\.\\:\\%]+$" ], "RelatedConditionProperty": "data.LineageAssertions[].ID" }, "ValueExtraction": { "ValuePath": "data.FacilityName" } } ] }, { "Name": "AssociatedProjectNames", "Policy": "ExtractAllMatches", "Paths": [{ "RelatedObjectsSpec": { "RelationshipDirection": "ChildToParent", "RelatedObjectID": "data.LineageAssertions[].ID", "RelatedObjectKind": "osdu:wks:master-data--SeismicAcquisitionSurvey:1.", "RelatedConditionMatches": [ "^[\\w\\-\\.]+:master-data\\-\\-SeismicAcquisitionSurvey:[\\w\\-\\.\\:\\%]+$" ], "RelatedConditionProperty": "data.LineageAssertions[].ID" }, "ValueExtraction": { "ValuePath": "data.ProjectName" } } ] } ] } ] } ``` ##### Case 2: Match the reference data values in any data partition (or ignoring the data partition) ``` { "Name": "WellLog-IndexPropertyPathConfiguration", "Code": "osdu:wks:work-product-component--WellLog:1.", "AttributionAuthority": "OSDU", "Configurations": [{ "Name": "WellUWI", "Policy": "ExtractFirstMatch", "Paths": [{ "ValueExtraction": { "RelatedConditionMatches": [ "^[\\w\\-\\.]+:reference-data--AliasNameType:UniqueIdentifier:$", "^[\\w\\-\\.]+:reference-data--AliasNameType:RegulatoryName:$", "^[\\w\\-\\.]+:reference-data--AliasNameType:PreferredName:$", "^[\\w\\-\\.]+:reference-data--AliasNameType:CommonName:$", "^[\\w\\-\\.]+:reference-data--AliasNameType:ShortName:$" ], "RelatedConditionProperty": "data.NameAliases[].AliasNameTypeID", "ValuePath": "data.NameAliases[].AliasName" } } ] } ] } ``` As required, to extend a property from a related record, the kind of the related record must be defined in the configuration. However, the Relationship type under ExtensionProperties does not define the kind of the target object. In some cases, the source record Example: Extend the related object's name to the document, name of the related objects 2. RelatedConditionProperty is limited to be a property of one level nested object. In the above examples, both `data.NameAliases[].AliasNameTypeID` and `data.ExtensionProperties.Relationships[].TargetID` are properties of one level nested object. In some cases, RelatedConditionProperty can be a property of multi-level nested object. For example ``` { "Name": "WellLog-IndexPropertyPathConfiguration", "Code": "osdu:wks:work-product-component--WellLog:1.", "AttributionAuthority": "OSDU", "Configurations": [{ "Name": "OrganisationNames", "Policy": "ExtractAllMatches", "Paths": [{ "RelatedObjectsSpec": { "RelationshipDirection": "ChildToParent", "RelatedObjectKind": "osdu:wks:master-data--Organisation:1.", "RelatedObjectID": "data.TechnicalAssurances[].Reviewers[].OrganisationID" "RelatedConditionMatches": [ "^[\\w\\-\\.]+:reference-data--ContactRoleType:ProjectManager:AccountOwner:$", "^[\\w\\-\\.]+:reference-data--ContactRoleType:AccountOwner:$" ], "RelatedConditionProperty": "data.TechnicalAssurances[].Reviewers[].RoleTypeID" }, "ValueExtraction": { "ValuePath": "data.OrganisationName" } } ] } ] } ```

ADR: Create field for case insensitive search

2024-02-26T17:16:00Z

# ADR: Add keywordLower Index Mapping field [[_TOC_]] # Status - [x] Proposed - [x] Trialing - [x] Under review - [x] Approved - [ ] Retired # Background Application developers would like to provide to their users a simple mechanism to enable searching that is much like SQL "LIKE" queries with lower function. Currently, none of the existing ElasticSearch fields implement this. # Context & Scope ## Requirements The desire is to support the following search query: ```json { "kind": "osdu:wks:master-data--Well:1.0.0", "query": "data.FacilityName.keywordLower:exam*" } ``` Which would return ```json { "results": [ { "data": { "FacilityName": "Example test" }, "id": "osdu:master-data--Well:1012" } ] } ``` # Tradeoff Analysis # Proposed solution A field in the index called keywordLower in which all input is normalized to lower case. For example, this mapping in master-data--Well would be created: ```json "CurrentOperatorID": { "type": "text", "fields": { "keyword": { "type": "keyword", "null_value": "null", "ignore_above": 256 }, "keywordLower": { "type": "keyword", "normalizer": "lowercase", "null_value": "null", "ignore_above": 256 } } }, ``` The 'keywordLower' field is added and has the additional attribute: "normalizer": "lowercase" # Change Management * Operators may need to re-ingest data or update the index. Is it possible to "patch" data to re-run the indexer on data already ingested? # Decision # Consequences * The indexer code changes should have no noticeable impact on the system or applications (only additional property created). * The index will be larger with the addition of the many instances of this field. Draft MR: https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/618

ReIndex API does not always update the schema mapping to ElasticSearch

2023-09-08T19:53:33Z

When an augmenter configuration is deployed, in order to make use of the updated configuration, two operations are required: 1. Update the schema mapping with extended schema from the augmenter configuration to ElasticSearch 2. Re-index the records of the affected kind. In this scenario, it is expected that users still can search the "old" data before the re-index is completed. Current implementation of ReIndex API does not always update the schema mapping to ElasticSearch if the forceClean option is not set to true. However, when the forceClean option is set to true, the original index will be deleted/purged. In this case, users may not be able to search the expected data before a new index is fully populated.

ADR: Full reindex API access must be elevated

2023-12-01T12:38:36Z

[[_TOC_]] # Status * [x] Proposed * [x] Trialing * [x] Under review * [ ] Approved * [ ] Retired # Context & Scope Expected use-case for the full reindex API is for disaster recovery scenario as it reindexes everything in a data-partition. Currently, full reindex API access is set to same level as other reindex APIs. Due to this, users with **users.datalake.admin** permission can **accidently** trigger a full reindex. To make matter worse, there are no APIs to cancel ongoing re-index, so this operation can run for hours/days depending on data-partition size. This can have impact on cost and service performance. # Requirements We need to elevate the permission level for the full reindex API so that users with Admin access cannot accidently trigger a full reindex. # Tradeoff Analysis This will be breaking change, but it should have low impact as this API is used very rarely/infrequently. # Solution The proposed solution is that the permission level for full reindex API should be elevated and set to **users.datalake.ops**. # Consequences * Change in indexer-core to Reindex API (permission elevation for full reindex) and PartitionSetup API (refactor) * Indexer service documentation needs to be updated # ADR Comments Below

Poor performance for index augmenter

2023-08-24T20:45:43Z

Though we made several enhancements related to index augmenter directly or indirectly, such as creating separating re-index topic, splitting the big message with 1000 records to small message with 50 records to support parallel indexing, and etc. We still found that the index performance with augmenter enabled is much worse than the index performance with augmenter disabled. For example, for WellLog with multiple extension configurations, the performance with augmenter enabled is about 15 times slower than the performance with augmenter disabled. With augmenter enabled, 1. Index one record individually, each record (for given property configurations) requires 8 queries to get all information in order to populate the extended properties. In this test test, cache does not take effect at all. 2. Index a kind with 291 WellLog records, each record requires 6.8 queries on average. In this test case, the cache should play important role. However, we found the cache mechanism basically does not take much effect. As I ran the tests from local, the latency of search is about 1.5 times longer than the latency of search in cloud env. I estimated that the performance with augmenter enabled is still about 10 times slower if we don't make any enhancement.

ADR: Index AsIngestedCoordinates

2024-03-18T14:08:04Z

# ADR: Index AsIngested Coordinates @chad @gehrmann @Keith_Wall @LFlakes @josh.townsend @lifeiliu @Java1Guy @srabanaguha [[_TOC_]] # Status - [x] Proposed - [x] Trialing - [x] Under review - [x] Approved - [ ] Retired # Background - Discussed in OSDU Geomatics Integration workstream and supported by Shell, BP, Exxon and Equinor geomatics representatives. - Discussed during AAF 2023-06-07 (Josh Townsend which has some recording and limited notes). - Discussion further in issue !95 (this issue) - Which refers to related issues: - #62 (1 year ago; reporting in M12 the AsIngestedCoordinates are not returned; kept open but with answer that GET storage can be used to retrieve the original record.) - [Issue 70 on geomatics board](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/geomatics/home/-/issues/70) (This is only a placeholder pointing to this issue 95 for monitoring. It has some interesting comments that belong really here, as follows: - _"The Architectural Advice Forum did not endorse indexing the AsIngested Coordinates as spatial objects that would permit spatial search, but that is not needed or requested. As we discussed, there are at least two options that would allow return of the coordinates from search: (1) Index AsIngested as an ordinary array. or (2) Add data needed for search as extended properties.)"_ - _"Thomas @gehrmann and I discussed and agreed the most robust solution is to index the AsIngested coordinates and CRS as a simple array, not a spatial object."_ This ADR writeup by Bert Kampes on request by Chad Leong is to help Shell developers have a clear idea of the proposed changes/specification, distilled from the above sources. The solution "way forward" is agreed, but not yet marked as "Approved" until after comments are received on this ADR specification design. # Context & Scope AsIngestedCoordinates are currently not returned by the search, but only the Wgs84Coordinates (after normalization of ingested data that has an AbstractSpatialLocation). These Wgs84Coordinates are in a GeoJSON structure and potentially can contain a geometry with many vertices. At some point in the past a determination was made in OSDU architecture that returning AsIngestedCoordinates would not be necessary. It is true to Wgs84Coordinates are normalized and used for search. However, AsIngestedCoordinates and CRS are important properties to be available from Search results for example for a list of wells. The Geomatics Workstream and others have commented that AsIngestedCoordinates were not returned as was expected. We learned AsIngestedCoordinates were omitted by design because of fear of performance degredation and because these coordinate values are not used for searches in most use cases. (However they are used for discovery and QC across records; and existing solutions typically do allow search with logical operators). A use case is for Well records. A developer may want to show to a user all the wells from a platform in a table, where one of the properties are the original coordinates and CRS. Currently this is only possible by retrieving each record through storage and it would be more efficient to have been returned by Search. Wells master data do not have an associated data file, such as a Wellbore might have in the form of a path in witsml. Another use case is ingesting data without a BoundCRS, i.e., cannot be normalized to Wgs84. Then it is useful to have Original Coordinates in the array so someone can see there were coordinates but no Wgs84 coordinates normalized. See also attached pptx from AAF and description and comments on issue !95. [Back to TOC](#TOC) ## AbstractSpatialLocation * [link to AbstractSpatialLocation](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/abstract/AbstractSpatialLocation.1.1.0.md), which has: * Quality metadata * * And includes [AbstractAnyCrsFeatureCollection](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/abstract/AbstractAnyCrsFeatureCollection.1.1.0.md?ref_type=heads), _A schema like GeoJSON FeatureCollection with a non-WGS 84 CRS context; based on https://geojson.org/schema/FeatureCollection.json. Attention: the coordinate order is fixed: Longitude/Easting/Westing/X first, followed by Latitude/Northing/Southing/Y, optionally height as third coordinate_, which has: * features[].geometry.type: Point, MultiPoint, LineString, MultiLineString, Polygon or MultiPolygon. * features[].geometry.coordinates (array), * And properties for * CoordinateReferenceSystemID * persistableReferenceCrs * VerticalCoordinateReferenceSystemID * persistableReferenceVerticalCrs * VerticalUnitID * persistableReferenceUnitZ [Back to TOC](#TOC) ## Requirements In addition to the simplified Elastic GeoJSON derived from Wgs84Coordinates that are currently already returned (i.e., no change to Wgs84Coordinates): * (Efficient) method to see the first AsIngested Coordinates, with their horizontal (and possible vertical) CRS(s), and specific metadata on location quality (which are part of the AbstractSpatialLocation entity). * It is expected that the first point coordinates are returned in search query responses if desired. * The string properties are expected to be 1. usable in queries and 2. be returned in search query responses if desired. * The coordinates of the first point are 1. numbers (in JSON speak floating point numbers), AsIngestedCoordinates.FirstPoint.X, AsIngestedCoordinates.FirstPoint.Y, AsIngestedCoordinates.FirstPoint.Z. 2. It is expected that the numbers can be used in simplistic box queries, provided the AsIngestedCoordinates.CoordinateReferenceSystemID (and AsIngestedCoordinates.VerticalCoordinateReferenceSystemID for 3D) are part of the query condition. 3. It is expected that the first point coordinates are returned in search query responses if desired. [Back to TOC](#TOC) # Tradeoff Analysis Discussion yielded that returning AsIngestedCoordinates as properties in the Search query response, only for the first point, and with some other SpatialLocation metadata is the correct tradeoff to satisfy Geomatics use cases and not burden the indexer performance or memory. [Back to TOC](#TOC) # Proposed solution (to be analyzed and implemented by Shell developers) * Following approach is proposed. It says proposed because I am not intimately familiar with the code or all possible gotchas that you may run into when developing. It mainly describes the situation from an end-user what is needed to be returned. For a record being ingested, for example a Well that may somehow have following AsIngestedCoordinates: ```json "data": { // Pseudo json follows. feel free to replace with a real example // AbstractSpatialLocation "SomeLocation": { "SpatialLocationCoordinatesDate": "2023-02-19", "QuantitativeAccuracyBandID": "<1 m", "QualitativeSpatialAccuracyTypeID": "Checked: Approved", "CoordinateQualityCheckPerformedBy": "Bert", "CoordinateQualityCheckDateTime": "2023-01-19", "CoordinateQualityCheckRemarks": [ "good", "really", "vertical is good too" ], "AppliedOperations": [ "conversion from ED_1950_UTM_Zone_31N to GCS_European_1950; 1 points converted", "transformation GCS_European_1950 to GCS_WGS_1984 using ED_1950_To_WGS_1984_24; 1 points successfully transformed" ], "SpatialParameterTypeID": "Outline", "SpatialGeometryTypeID": "Point" }, // AbstractAnyCrsFeatureCollection "AsIngestedCoordinates": { "CoordinateReferenceSystemID": "osdu:reference-data--CoordinateReferenceSystem:BoundProjected:EPSG::32021_EPSG::15851:", "VerticalCoordinateReferenceSystemID": "osdu:reference-data--CoordinateReferenceSystem:Vertical:EPSG::5714:", "VerticalUnitID": "osdu:reference-data--UnitOfMeasure:m:", "persistableReferenceCrs": "{\"authCode\":{\"auth\":\"OSDU\",\"code\":\"32021079\"},\"lateBoundCRS\":{\"authCode\":{\"auth\":\"EPSG\",\"code\":\"32021\"},\"name\":\"NAD_1927_StatePlane_North_Dakota_South_FIPS_3302\",\"type\":\"LBC\",\"ver\":\"PE_10_9_1\",\"wkt\":\"PROJCS[\\\"NAD_1927_StatePlane_North_Dakota_South_FIPS_3302\\\",GEOGCS[\\\"GCS_North_American_1927\\\",DATUM[\\\"D_North_American_1927\\\",SPHEROID[\\\"Clarke_1866\\\",6378206.4,294.9786982]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433]],PROJECTION[\\\"Lambert_Conformal_Conic\\\"],PARAMETER[\\\"False_Easting\\\",2000000.0],PARAMETER[\\\"False_Northing\\\",0.0],PARAMETER[\\\"Central_Meridian\\\",-100.5],PARAMETER[\\\"Standard_Parallel_1\\\",46.18333333333333],PARAMETER[\\\"Standard_Parallel_2\\\",47.48333333333333],PARAMETER[\\\"Latitude_Of_Origin\\\",45.66666666666666],UNIT[\\\"Foot_US\\\",0.3048006096012192],AUTHORITY[\\\"EPSG\\\",32021]]\"},\"name\":\"NAD27 * OGP-Usa Conus / North Dakota CS27 South zone [32021,15851]\",\"singleCT\":{\"authCode\":{\"auth\":\"EPSG\",\"code\":\"15851\"},\"name\":\"NAD_1927_To_WGS_1984_79_CONUS\",\"type\":\"ST\",\"ver\":\"PE_10_9_1\",\"wkt\":\"GEOGTRAN[\\\"NAD_1927_To_WGS_1984_79_CONUS\\\",GEOGCS[\\\"GCS_North_American_1927\\\",DATUM[\\\"D_North_American_1927\\\",SPHEROID[\\\"Clarke_1866\\\",6378206.4,294.9786982]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433]],GEOGCS[\\\"GCS_WGS_1984\\\",DATUM[\\\"D_WGS_1984\\\",SPHEROID[\\\"WGS_1984\\\",6378137.0,298.257223563]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433]],METHOD[\\\"NADCON\\\"],PARAMETER[\\\"Dataset_conus\\\",0.0],OPERATIONACCURACY[5.0],AUTHORITY[\\\"EPSG\\\",15851]]\"},\"type\":\"EBC\",\"ver\":\"PE_10_9_1\"}", "persistableReferenceVerticalCrs": "{\"authCode\":{\"auth\":\"EPSG\",\"code\":\"5714\"},\"name\":\"MSL_Height\",\"type\":\"LBC\",\"ver\":\"PE_10_9_1\",\"wkt\":\"VERTCS[\\\"MSL_Height\\\",VDATUM[\\\"Mean_Sea_Level\\\"],PARAMETER[\\\"Vertical_Shift\\\",0.0],PARAMETER[\\\"Direction\\\",1.0],UNIT[\\\"Meter\\\",1.0],AUTHORITY[\\\"EPSG\\\",5714]]\"}", "persistableReferenceUnitZ": "{\"scaleOffset\":{\"scale\":1.0,\"offset\":0.0},\"symbol\":\"m\",\"baseMeasurement\":{\"ancestry\":\"Length\",\"type\":\"UM\"},\"type\":\"USO\"}", "features": [ // NOTE: A well will only have a single AnyCrsPoint for the surface location, potentially 2D, rather than 3D (and then also no vertical CRS, etc.). But I added here the 3D and additional AnyCrsLineString just to make clear what to do in this case. { "type": "AnyCrsFeature" "geometry": { "type": "AnyCrsPoint" "coordinates": [1500000.0, 12345678.0, 100.0] } }, { "type": "AnyCrsFeature" "geometry": { "type": "AnyCrsLineString" "coordinates": [[1400000.0, 12345666.0, 99.0], [1600000.0, 12345777.0, 101.0]] } } ] // Wgs84 Coordinates "Wgs84Coordinates": { etc. Not relevant} } } ``` The desired end result of a query search response would include the following properties. They are a direct copy of the input record AbstractSpatialLocation fragment. ```json { "data": { "AsingestedCoordinates.FirstPoint.X": 222222.0, // Number (floating point) if given on ingest of course "AsingestedCoordinates.FirstPoint.Y": 111111.0, // Number. "AsingestedCoordinates.FirstPoint.Z": 100.0, // Number. Blank (null) unless the input had a Z value "AsingestedCoordinates.CoordinateReferenceSystemID": "xxx", // see note below. OSDU allows data ingesting with PR and not with a reference to a CRS record id. What to do then? "AsingestedCoordinates.VerticalCoordinateReferenceSystemID": "xxx", // for 3D Z value if in input "AsingestedCoordinates.persistableReferenceCrs": "string xxx", // see note below. "AsingestedCoordinates.persistableReferenceVerticalCrs": "string xxx", "AsingestedCoordinates.persistableReferenceUnitZ": "string xxx", "AsingestedCoordinates.QuantitativeAccuracyBandID": "xxx", "AsingestedCoordinates.QualitativeSpatialAccuracyTypeID": "xxx", "AsingestedCoordinates.CoordinateQualityCheckPerformedBy": "xxx", "AsingestedCoordinates.CoordinateQualityCheckDateTime": "xxx", "AsingestedCoordinates.CoordinateQualityCheckRemarks[]": "(string array)", "AsingestedCoordinates.AppliedOperations[]": "(string array)" } } ``` Note: * AsingestedCoordinates.FirstPoint.Type is not needed because Wgs84Coordinates will have the original type. Though perhaps it is useful in case the FirstPoint was something like "AnyCrsMultiPoint" to know that * AsingestedCoordinates.SpatialLocationCoordinatesDate is not needed because QC time is already there and this is more for plate motion that seems not needed at the moment. We could add it though.

expand me

Got you!

[Back to TOC](#TOC) ## Accepted Limitations / things to work out The following are some accepted limitation of the proposed solution, e.g., that we agree only to index the first point in a flat array and not as a geometry for reasons of performance. There are also some Questions which the developers will have to contemplate and propose a solution for (which may be that there is no solution). * Only first point of the AsIngested geometry is accepted if geometry contains more than 1 point. * If it would be useful or better to use a switch or flag to search so user can decide when to include geometry in the response (I would argue then Wgs84 and AsIngested) then it is fine if by default they are returned but can be omitted. But I expect this is already the case using the ReturnedFields. - In itself it seems not a bad option to be default omit the geometry because it can be large for 2D lines or so. But that is not the intention of this issue. * What to do if the ingested Geometry is complex? * _It is not relevant to the implementation, but please clarify if AnyCrsfeatureCollection indeed can contain both Points and LineStrings (for example) or has to contain only a single feature. The name collection suggests it can be complex combination of types._ * If AsIngested geometry contains multiple types or OneOff then - Index the Point if it exists, else the first point of a MultiPoint, else the first point of a LineString, else of a MultiLineString, else of a Polygon, else of the MultiPolygon (else nothing, there is no geometry!). * What to do if there is a PR but no CRS id on input? - Option 1 is to not return the CRS and no coordinates but that is not satisfactory. - Option 2 is to not return the CRS but coordinates. - Option 3 is to return the PR in the CRS ID field. - Option 4 is to return the PR as PR (preferred). - Option 5 is to look up the id of the PR (but we do not have a function for that and would take time...). In a way this is ideal though, but we expect people to ingest data with a (bound CRS) record id. * Can somehow the CRS Name (Hor and Vert) be returned? - Option 1 is no. I think we have to accept this, because the name is not part of the input. - Option 2 is yes. Because the normalizer will print in OperationsApplied the CRS Name (at least for the horizontal which is most important). - Option 3 is to look up the CRS by id and then retrieve some parameters (for example the PR to augment the stored and indexed record with the numerical definition used at the time of normalization; as a permanent record frozen in time what was applied at the time of ingestion - which was the original requirement in 2021 for ingested data to look up the PR and store it with the data but this was said not to be possible.) * Can somehow the AppliedOperations be returned or not too useful to bother? - Option 1 is yes. - Option 2 is no. [Back to TOC](#TOC) # Change Management * Operators may need to re-ingest data or update the index. Is it possible to "patch" data to re-run the indexer on data already ingested? # Decision * Implement by Shell developers working on Search Service. # Consequences * The indexer code changes should have no noticable impact on the system or applications (only additional property returned). [Back to TOC](#TOC) #EOF.

ADR: Replay API

2023-12-13T11:36:34Z

OSDU - Replay API # Table of Contents [Context ](#_toc119676063) [Decision ](#_toc119676075) [Design ](#_toc119676076) [Requirements to address ](#_toc119676077) ## Status * [x] Proposed * [ ] Trialing * [ ] Under review * [ ] Approved * [ ] Retired ## Context This ADR is centered around the implementation of the new replay API within OSDU's storage service. The purpose of this Replay API is to publish messages that indicate changes to records, which are subsequently received and processed by consumers. It's important to note that the handling of these messages follows an idempotent process. ## Decision The Replay API will address following- a) **Disaster recovery -** All records in storage are brought back to RPO (Recovery Point Objective) state. b) **Responsibility of publishing record change messages for consumer services** - 1. **Indexer Service** - The Indexer service will be the consumer to the reindex event. 2. **Schema Service**- Correction of indices after changes to structure of the storage records of a particular kind. ## Design The following options were considered for Design - |**Options**|**Pro**|**Cons**|**Work Required**| | :- | :- | :- | :- | |1. Using **Airflow** + Message Broker + StorageService + Workflow Service|

- Proven Workflow Engine

- Lesser new implementations in storage services, so lesser work required by other CSPs.

- Process becomes slower and inefficient.

- Lot of HTTP calls from Airflow <-> AKS

- Airflow will require access to internal Infrastructure to operate in the most efficient manner.

- Some required features are not yet available in ADF Airflow

- Parallelization may spawn up 1000s of tasks waiting to be scheduled. **Scalability can be issue.**

- Concurrency and Safety guarantee is tricky – allowing no more than one reindex for a kind

**Airflow**

- DAG using TaskGroups, Dynamic Task Mapping, Concurrency handling.

- Build pipelines to integrate new DAG.

**Storage Service**

- Implement new APIs to publish messages to message broker.

**Indexer Service**

**Workflow Service**

- Have new APIs to support observability

- Design for checkpointing

| |2. Using **StorageService** + **Message Broker**|

- Simple, Lesser moving parts

- Fast & Efficient

|- Parallelization may require state management.|

**Storage Service**

- New APIs for exposing Replay functionality (ReplayAll, ReplayKind, GetReplayStatus)

- New Modules for replay message processing

**Indexer Service**

- Delete ALL kinds API

| **Design Approach for option 2:** ![Aspose.Words.71972436-70f7-48df-8f1c-d2035f55ce34.004](/uploads/362b2ef367dc8e21657ba87f7777c60d/Aspose.Words.71972436-70f7-48df-8f1c-d2035f55ce34.004.png) **Implementation Steps:** Attaching the swagger yaml describing the Replay API. [ReplayAPI_2.0.yaml](/uploads/2337ac52ea50c34ae50937c7086bfb9e/ReplayAPI_2.0.yaml)

()

2023-07-02T23:01:20Z

()

Use specific topic instead of the storage record change topic to send the re-...

2024-03-01T12:04:33Z

In current implementation of Azure indexer, re-index events share the same topic of the storage record change events. It creates several kinds of problems: 1. Create unnecessary load on the storage service as many other services monitor the storage change events and react, e.g. data synch with external datastores 2. It could affect the index/re-index performance if storage service is busy 3. Create unnecessary duplicate copies of the data, e.g. multiple copies/versions of wks records with extract same content could be created. 4. Events generated from re-index or index-extension could block storage record change events which could have impact on SLO requirements in terms of index update latency. We should use specific topic for re-index to send and receive the re-index events.

ADR: new reindex API to reindex the given records

2023-10-03T14:39:44Z

## Status - [ ] Proposed - [ ] Trialing - [ ] Under review - [X] Approved - [ ] Retired ## Context As of now, indexer has a reindex API to reindex the whole given kind. The API is useful in the scenarios when index data need to be migrated because of some bug fixes, new indexer features etc. Sometimes, it may not necessary to reindex the entire kind if we know the exact impact, so it will be good to have a reindex API that only reindex the given records. The use cases of the new API could be: 1. If there is a indexer bug or new indexer feature deployed, and we know exactly what are the records been impacted, we could use such API to only reindex those records 2. When user ingests data, and data successfully created in storage, but failed to be indexed in indexer for any reason. Application could use such API to manually fix the impacted records instead of reindexing the whole kind ## API spec ```yaml paths: "/api/indexer/v2/reindex/records": post: requestBody: content: application/json: shema: $ref: '#/components/schemas/ReindexRecordsRequest' schemas: ReindexRecordsRequest: type: object properties: recordIds: type: array items: type: string example: ["recordId1", "recordId2] ``` ## Limit We will limit the given number of records as 1000 initially ```