Indexer issueshttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues2020-10-09T21:29:54Zhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/6Branch merged to master with broken AWS integration tests2020-10-09T21:29:54ZMatt WiseBranch merged to master with broken AWS integration tests!29 was merged to master with a failing pipeline. The AWS tests were previously passing prior to the merge, but the merge was completed even though the pipeline failed.
The breakage was the result of changing the [testing/indexer-test-...!29 was merged to master with a failing pipeline. The AWS tests were previously passing prior to the merge, but the merge was completed even though the pipeline failed.
The breakage was the result of changing the [testing/indexer-test-core/pom.xml](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/testing/indexer-test-core/pom.xml) os-core-common version from 0.3.6 to 0.3.12 which seems to have changed the dependency for jackson data mapper (which is required by AWS)
This change should have had CSP approval since it touches core code, but it was not approved nor was the pipeline passing first as required.M1 - Release 0.1David Diederichd.diederich@opengroup.orgethiraj krishnamanaiduDania Kodeih (Microsoft)JoeDaniel SchollMatt WiseDavid Diederichd.diederich@opengroup.orghttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/2Elasticsearch version upgrade2023-08-18T15:51:32Zethiraj krishnamanaiduElasticsearch version upgradeCurrent Version
- Elastic Server: 6.8.1
- Elastic Client version(OSDU Indexer service): 6.6.1
Proposed Version upgrade
* Elastic Server: 7
* Elastic Client version(OSDU Indexer service): 7
we need to upgrade the client and ela...Current Version
- Elastic Server: 6.8.1
- Elastic Client version(OSDU Indexer service): 6.6.1
Proposed Version upgrade
* Elastic Server: 7
* Elastic Client version(OSDU Indexer service): 7
we need to upgrade the client and elastic server version, this would require the following changes...
* Update the Indexer service, code change + lib version upgrade.
* Upgraded Elastic server in all clouds(AWS, Azure, Google, and IBM..etc).M1 - Release 0.1ethiraj krishnamanaiduDmitriy Rudkoethiraj krishnamanaiduhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/1[Indexer] Support for indexing documents with nested arrays of objects2024-01-11T12:28:02ZGary Murphy[Indexer] Support for indexing documents with nested arrays of objectsJSON documents with nested arrays of objects are not currently indexed by the Indexer. The capability needs to be added so that search queries on such documents can be executed. Understanding that there are performance issues with all...JSON documents with nested arrays of objects are not currently indexed by the Indexer. The capability needs to be added so that search queries on such documents can be executed. Understanding that there are performance issues with allowing too many levels of nested arrays to be search, it is proposed that limitations be put on the number of levels allowed for nested indexing. Additionally, in cases where an abstract base schema is defined for the attribute type (example: AbstractFacilityEvent in AbstractFacility.json), the indexer should only support indexing the abstract base schema entities and not extensions added to the concrete definition.M1 - Release 0.1https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/38Indexer not able to handle three coordinates2021-10-25T20:52:20ZAn NgoIndexer not able to handle three coordinatesSample record:
```
"data": {
"SpatialLocation": {
"AsIngestedCoordinates": {
"features": [
{
"geometry": {
"coordinates": [
313405.9477893702,
6544797.620047403,
6.56167979...Sample record:
```
"data": {
"SpatialLocation": {
"AsIngestedCoordinates": {
"features": [
{
"geometry": {
"coordinates": [
313405.9477893702,
6544797.620047403,
6.561679790026246
],
"bbox": null,
"type": "AnyCrsPoint"
},
"bbox": null,
"properties": {},
"type": "AnyCrsFeature"
}
],
"bbox": null,
"properties": {},
"persistableReferenceCrs": "reference",
"persistableReferenceUnitZ": "reference",
"type": "AnyCrsFeatureCollection"
}, "Wgs84Corrdinates": {
"type": "FeatureCollection",
"bbox": null,
"features": [
{
"type": "Feature",
"bbox": null,
"geometry": {
"type": "Point",
"bbox": null,
"coordinates": [
5.7500000010406245,
59.000000000399105,
1.9999999999999998
]
},
"properties": {}
}
],
"properties": {},
"persistableReferenceCrs": null,
"persistableReferenceUnitZ": "reference"
}
"msg": "testing record 2",
"X": 16.00,
"Y": 10.00,
"Z": 0
}
}
```M9 - Release 0.12https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/37Indexer returns only LAST geometry for "type": "AnyCrsGeometryCollection"2021-10-25T20:52:46ZAn NgoIndexer returns only LAST geometry for "type": "AnyCrsGeometryCollection"Indexer not able to return all the converted geometries for "type": "AnyCrsGeometryCollection", it returns only the last geometry type.
Sample record:
```
"AsIngestedCoordinates": {
"features": [
...Indexer not able to return all the converted geometries for "type": "AnyCrsGeometryCollection", it returns only the last geometry type.
Sample record:
```
"AsIngestedCoordinates": {
"features": [
{
"geometry": {
"type": "AnyCrsGeometryCollection",
"bbox": null,
"geometries": [
{
"type": "Point",
"bbox": null,
"coordinates": [
500000.0,
7000000.0
]
},
{
"type": "LineString",
"bbox": null,
"coordinates": [
[
501000.0,
7001000.0
],
[
502000.0,
7002000.0
]
]
}
]
},
"bbox": null,
"properties": {},
"type": "AnyCrsFeature"
}
],
"bbox": null,
"properties": {},
"persistableReferenceCrs": "{\"lateBoundCRS\":{\"wkt\":\"PROJCS[\\\"ED_1950_UTM_Zone_32N\\\",GEOGCS[\\\"GCS_European_1950\\\",DATUM[\\\"D_European_1950\\\",SPHEROID[\\\"International_1924\\\",6378388.0,297.0]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433]],PROJECTION[\\\"Transverse_Mercator\\\"],PARAMETER[\\\"False_Easting\\\",500000.0],PARAMETER[\\\"False_Northing\\\",0.0],PARAMETER[\\\"Central_Meridian\\\",9.0],PARAMETER[\\\"Scale_Factor\\\",0.9996],PARAMETER[\\\"Latitude_Of_Origin\\\",0.0],UNIT[\\\"Meter\\\",1.0],AUTHORITY[\\\"EPSG\\\",23032]]\",\"ver\":\"PE_10_3_1\",\"name\":\"ED_1950_UTM_Zone_32N\",\"authCode\":{\"auth\":\"EPSG\",\"code\":\"23032\"},\"type\":\"LBC\"},\"singleCT\":{\"wkt\":\"GEOGTRAN[\\\"ED_1950_To_WGS_1984_23\\\",GEOGCS[\\\"GCS_European_1950\\\",DATUM[\\\"D_European_1950\\\",SPHEROID[\\\"International_1924\\\",6378388.0,297.0]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433]],GEOGCS[\\\"GCS_WGS_1984\\\",DATUM[\\\"D_WGS_1984\\\",SPHEROID[\\\"WGS_1984\\\",6378137.0,298.257223563]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433]],METHOD[\\\"Position_Vector\\\"],PARAMETER[\\\"X_Axis_Translation\\\",-116.641],PARAMETER[\\\"Y_Axis_Translation\\\",-56.931],PARAMETER[\\\"Z_Axis_Translation\\\",-110.559],PARAMETER[\\\"X_Axis_Rotation\\\",0.893],PARAMETER[\\\"Y_Axis_Rotation\\\",0.921],PARAMETER[\\\"Z_Axis_Rotation\\\",-0.917],PARAMETER[\\\"Scale_Difference\\\",-3.52],AUTHORITY[\\\"EPSG\\\",1612]]\",\"ver\":\"PE_10_3_1\",\"name\":\"ED_1950_To_WGS_1984_23\",\"authCode\":{\"auth\":\"EPSG\",\"code\":\"1612\"},\"type\":\"ST\"},\"ver\":\"PE_10_3_1\",\"name\":\"ED50 * EPSG-Nor N62 2001 / UTM zone 32N [23032,1612]\",\"authCode\":{\"auth\":\"SLB\",\"code\":\"23032023\"},\"type\":\"EBC\"}",
"persistableReferenceUnitZ": "{\"baseMeasurement\":{\"ancestry\":\"Length\",\"type\":\"UM\"},\"scaleOffset\":{\"offset\":0.0,\"scale\":0.3048},\"symbol\":\"ft\",\"type\":\"USO\"}",
"type": "AnyCrsFeatureCollection"
}
```M9 - Release 0.12https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/43Records of a new Kind can be unsearchable due to race condition2022-08-23T13:30:00ZNitin-slbRecords of a new Kind can be unsearchable due to race conditionIf an Elasticsearch index (with proper schema definition and mapping) is not created ahead of record ingestion via Storage, Elasticsearch creates a default index mapping when Indexer processes the first records with that schema. Search s...If an Elasticsearch index (with proper schema definition and mapping) is not created ahead of record ingestion via Storage, Elasticsearch creates a default index mapping when Indexer processes the first records with that schema. Search service does not work with this default mapping.
Creating an index with proper mapping and making a shard ready typically takes a few seconds and an issue has been noticed when multiple Indexer service instances try to index a new kind. One instance will try to create the index, while another instance will see the index as created and start indexing with default mapping. This makes the kind/entity unsearchable.<br/><br/>
Simple (and common scenario):
- Ingestion job created that uses a new kind for the incoming records
- Ingestion job starts using multiple threads.
- When the new kind on the incoming records is encountered by the first indexer thread, it needs to be created (the index), and index creation starts
- In the few seconds the first indexer thread is creating the "real" index, other threads process N records (likely 1.5 * number of seconds for index creation + # of threads) using the default mapping
- The N records created using the default mapping are unusable.M11 - Release 0.14https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/34Reindex API not working2022-06-22T16:33:43ZNeelesh ThakurReindex API not workingReindex API fails in general failing for most of the requests with following error messages:
```json
{
"code": 415,
"reason": "Unsupported media type",
"message": "upstream server responded with unsupported media type: text/...Reindex API fails in general failing for most of the requests with following error messages:
```json
{
"code": 415,
"reason": "Unsupported media type",
"message": "upstream server responded with unsupported media type: text/html"
}
```
This happens on larger data set (> 100K) on most of requests and never gets finished.
Reindex API uses different Storage API to retrieve and per log message if it encounters non-json response from storage service, reindex will stop working.
Another insight, reindex API uses Storage get record by Kind using cursor. Cursor returned by this API at times is too long, when we try to make request to this API (tried with Postman rest client) using such cursor we do get non-JSON 400 response.
```html
<!doctype html>
<html lang="en">
<head>
<title>HTTP Status 400 – Bad Request</title>
<style type="text/css">
body {
font-family: Tahoma, Arial, sans-serif;
}
h1,
h2,
h3,
b {
color: white;
background-color: #525D76;
}
h1 {
font-size: 22px;
}
h2 {
font-size: 16px;
}
h3 {
font-size: 14px;
}
p {
font-size: 12px;
}
a {
color: black;
}
.line {
height: 1px;
background-color: #525D76;
border: none;
}
</style>
</head>
<body>
<h1>HTTP Status 400 – Bad Request</h1>
</body>
</html>
```M11 - Release 0.14https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/70[Azure] Jackson conflic dependencies2022-07-27T01:01:25ZErnesto Gutierrez[Azure] Jackson conflic dependenciesJackson xml conflic version 2.11.4 with new jackson core 2.13.2. Causes intermitten indexer behavior, some entities are indexed and other don't.Jackson xml conflic version 2.11.4 with new jackson core 2.13.2. Causes intermitten indexer behavior, some entities are indexed and other don't.M12 - Release 0.15https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/63[Bug] [CRS normalization] CRS conversion with implicitly specified Z coordina...2022-05-19T07:35:09ZKateryna Kurach (EPAM)[Bug] [CRS normalization] CRS conversion with implicitly specified Z coordinate is not working if used AnyCrsMultiPolygon as a geometry typeSteps to reproduce:
1. Ingest the attached manifest [3D_polygon.txt](/uploads/1a87a2fb39c473a43af6df8547e7c10c/3D_polygon.txt)
2. Execute the following request:
POST https://{{SEARCH_HOST}}/query
{
"kind": "*:*:*:*",
"limit": 300,
"...Steps to reproduce:
1. Ingest the attached manifest [3D_polygon.txt](/uploads/1a87a2fb39c473a43af6df8547e7c10c/3D_polygon.txt)
2. Execute the following request:
POST https://{{SEARCH_HOST}}/query
{
"kind": "*:*:*:*",
"limit": 300,
"query": "id: \"odesprod:work-product-component--SeismicBinGrid:12may3Dpolygon\""
}
Expected result:
Coordinates are transformed into WGS84 CRS and are present in the output
Actual result:
No any coordinate information is displayedM12 - Release 0.15Rustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comRustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/71Indexer not creating new index in Elasticsearch when new schema is added2022-09-12T21:55:55ZYifei XuIndexer not creating new index in Elasticsearch when new schema is addedIt was noticed that Elastic search indexes are not created when we register a Schema. Instead, they are created when we ingest the data the first time. Index mappings are created automatically based on the ingested record, not based on t...It was noticed that Elastic search indexes are not created when we register a Schema. Instead, they are created when we ingest the data the first time. Index mappings are created automatically based on the ingested record, not based on the schema. Due to this behavior many attributes and data types are not properly indexed.
We want to understand if this is the intended behavior in the core code logic. This was at least observed on AWS.
Steps to Reproduce:
- Create new OSDU environment with sample data (Except “osdu:wks:dataset--FileCollection.Generic:1.0.0” data)
- Search for FileCollection Schema {{osdu_base_url}}/api/schema-service/v1/schema/osdu:wks:dataset--FileCollection.Generic:1.0.0. This will return the schema structure.
- Login to Elastic search container
- Run CURL to list indices matching FileCollection curl -u elastic:<pwd> https://localhost:9200/_cat/indices -k | grep -i file
- There will not be any index for FileCollection
- Use Dataset Service to add a record for FileCollection without Data.DatasetProperties.FileSourceInfos
- Login to Elastic search container search for the index using command curl -u elastic:<pwd> https://localhost:9200/_cat/indices -k | grep -i file
- Now new index will be created for FileCollection based on the payload and not by the Schema structure.
- The index will not have any mapping for Data.DatasetProperties.FileSourceInfos
Here are some important questions:
1. Should an index be created after a new schema is created?
1. If not, how will the index be created when a record is added (for cases with and without schema already present in the system)
1. What should happen to the index when the schema is updated?
@fhoueto.amz @gustavurda @debasisc @chad
M14 - Release 0.17Yifei XuGustavo UrdanetaYifei Xuhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/74Not possible to upgrade core-common in parent pom without migration from spri...2023-03-31T11:25:39ZRustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comNot possible to upgrade core-common in parent pom without migration from springfox to springdoc-openapiCurrently, in Indexer root pom core-common dependency is quite outdated and furthermore points to release candidate versions which could be easily erased during the repository clean-up routine:<br/>
https://community.opengroup.org/osdu/p...Currently, in Indexer root pom core-common dependency is quite outdated and furthermore points to release candidate versions which could be easily erased during the repository clean-up routine:<br/>
https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/pom.xml#L16
This core-common version propagates old spring-boot dependencies to provider modules:
~~~
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ indexer-service ---
[INFO] org.opengroup.osdu.indexer:indexer-service:pom:0.17.0-SNAPSHOT
[INFO] +- org.opengroup.osdu:os-core-common:jar:0.14.0-rc8:compile
[INFO] | +- org.springframework.boot:spring-boot-starter-web:jar:2.4.12:compile
[INFO] | | +- org.springframework.boot:spring-boot-starter:jar:2.4.12:compile
[INFO] | | | +- org.springframework.boot:spring-boot:jar:2.4.12:compile
~~~
But if we upgrade it to the latest release version `16.01` it will bring new spring dependencies:
~~~
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ indexer-service ---
[INFO] org.opengroup.osdu.indexer:indexer-service:pom:0.17.0-SNAPSHOT
[INFO] +- org.opengroup.osdu:os-core-common:jar:0.16.1:compile
[INFO] | +- org.springframework.boot:spring-boot-starter-web:jar:2.7.2:compile
[INFO] | | +- org.springframework.boot:spring-boot-starter:jar:2.7.2:compile
[INFO] | | | +- org.springframework.boot:spring-boot:jar:2.7.2:compile
~~~
And they are not compatible with spring-fox that used for API documentation by Indexer service:<br/>
https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/pom.xml#L150
Since spring-fox does not get updates anymore and is not compatible with new versions of spring-boot, it will block us in further dependency upgrades: <br/>
https://github.com/springfox/springfox/issues/3462
Upgrade will cause runtime errors and the Indexer service will not be able to start up:
~~~
org.springframework.context.ApplicationContextException: Failed to start bean 'documentationPluginsBootstrapper'; nested exception is java.lang.NullPointerException
at org.springframework.context.support.DefaultLifecycleProcessor.doStart(DefaultLifecycleProcessor.java:181)
at org.springframework.context.support.DefaultLifecycleProcessor.access$200(DefaultLifecycleProcessor.java:54)
at org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.start(DefaultLifecycleProcessor.java:356)
at java.lang.Iterable.forEach(Iterable.java:75)
at org.springframework.context.support.DefaultLifecycleProcessor.startBeans(DefaultLifecycleProcessor.java:155)
at org.springframework.context.support.DefaultLifecycleProcessor.onRefresh(DefaultLifecycleProcessor.java:123)
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:935)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:586)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:147)
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:734)
at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:408)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:308)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1306)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1295)
at org.opengroup.osdu.indexer.IndexerGcpApplication.main(IndexerGcpApplication.java:33)
Caused by: java.lang.NullPointerException: null
at springfox.documentation.spring.web.WebMvcPatternsRequestConditionWrapper.getPatterns(WebMvcPatternsRequestConditionWrapper.java:56)
~~~M15 - Release 0.18Rustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comRustam Lotsmanenko (EPAM)rustam_lotsmanenko@epam.comhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/78ADR: Normalized kind indexed field2023-09-26T06:00:27ZMingyang ZhuADR: Normalized kind indexed field<a name="TOC"></a>
[[_TOC_]]
# Status
- [ ] Proposed
- [x] Approved
- [ ] Retired
# Context & Scope
Schema id includes the semantic versioning and is indexed as "kind" in the OSDU indexer service. Indexer indexes each "kind" as a se...<a name="TOC"></a>
[[_TOC_]]
# Status
- [ ] Proposed
- [x] Approved
- [ ] Retired
# Context & Scope
Schema id includes the semantic versioning and is indexed as "kind" in the OSDU indexer service. Indexer indexes each "kind" as a separate index in elastic search. Therefore, records from different schemas will have different "kind" and "index" in elastic search even for the same major version schemas. So far there is no direct attribute we can use from search to group (aggregateBy payload) the data by the schema major version. However, in the application, user may want to either group all major version of one data type or in some cases only care about the latest version of the same major version. We'd like to propose an approach to enable this for the OSDU applications.
[Back to TOC](#TOC)
---
# Requirement
- The proposed solution should solve the index major version issue without significant performance degradation
- The proposed solution should be compatible with the existing business data that upstream OSDU applications stores
[Back to TOC](#TOC)
---
## Approach 1
Elastic search allows to pass the script to create [runtime field](https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html) and then search or aggregate by such field. Since the indexed "kind" field already has all the information, but need to remove the minor and patch version from it. We could solve the problem from the OSDU search side to build the pre-defined runtime field script for user to consume.
The advantage of the approach is that we don't need to re-index the existing data. However, there is a cost that the server needs to run the script at runtime so there is performance degradation. We have done some load test to compare the aggregateBy on indexed field and runtime field. The performance degradation is pretty significant which is about 70% slower on median and 90%ile latency, so we pass this approach
[Back to TOC](#TOC)
---
## Approach 2 (Proposed)
Take the performance into account, we have to physically indexed the new field. We are proposing to index this additional field under [record tags](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/Chapters/06-LifecycleProperties.md#619-record-tags) field with a new sub attribute key "normalizedKind". The value of the "normalizedKind" will be derived from the original "kind" value by removing minor and patch version. E.g. if a mater-data--Wellbore record of kind "osdu:wks:master-data--Wellbore:1.1.0", such record will have a new field tags.normalizedKind with value "osdu:wks:master-data--Wellbore:1"
- Example of how to use the new field in search query
```
{
"query": "tags.normalizedKind:\"osdu:wks:master-data--Wellbore:1\""
}
```
- Example of how to use the new field in search aggregateBy
```
{
"aggregateBy": "tags.normalizedKind"
}
```
**This approach requires re-indexing operation during deployment to take effect on existing data.**
[Back to TOC](#TOC)M16 - Release 0.19Mingyang ZhuZhibin MaiMingyang Zhuhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/91Use specific topic instead of the storage record change topic to send the re-...2024-03-01T12:04:33ZZhibin MaiUse specific topic instead of the storage record change topic to send the re-index eventsIn current implementation of Azure indexer, re-index events share the same topic of the storage record change events. It creates several kinds of problems:
1. Create unnecessary load on the storage service as many other services monitor ...In current implementation of Azure indexer, re-index events share the same topic of the storage record change events. It creates several kinds of problems:
1. Create unnecessary load on the storage service as many other services monitor the storage change events and react, e.g. data synch with external datastores
2. It could affect the index/re-index performance if storage service is busy
3. Create unnecessary duplicate copies of the data, e.g. multiple copies/versions of wks records with extract same content could be created.
4. Events generated from re-index or index-extension could block storage record change events which could have impact on SLO requirements in terms of index update latency.
We should use specific topic for re-index to send and receive the re-index events.M19 - Release 0.22Zhibin MaiZhibin Maihttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/90ADR: new reindex API to reindex the given records2023-10-03T14:39:44ZMingyang ZhuADR: new reindex API to reindex the given records
## Status
- [ ] Proposed
- [ ] Trialing
- [ ] Under review
- [X] Approved
- [ ] Retired
## Context
As of now, indexer has a reindex API to reindex the whole given kind. The API is useful in the scenarios when index data need to be migr...
## Status
- [ ] Proposed
- [ ] Trialing
- [ ] Under review
- [X] Approved
- [ ] Retired
## Context
As of now, indexer has a reindex API to reindex the whole given kind. The API is useful in the scenarios when index data need to be migrated because of some bug fixes, new indexer features etc. Sometimes, it may not necessary to reindex the entire kind if we know the exact impact, so it will be good to have a reindex API that only reindex the given records.
The use cases of the new API could be:
1. If there is a indexer bug or new indexer feature deployed, and we know exactly what are the records been impacted, we could use such API to only reindex those records
2. When user ingests data, and data successfully created in storage, but failed to be indexed in indexer for any reason. Application could use such API to manually fix the impacted records instead of reindexing the whole kind
## API spec
```yaml
paths:
"/api/indexer/v2/reindex/records":
post:
requestBody:
content:
application/json:
shema:
$ref: '#/components/schemas/ReindexRecordsRequest'
schemas:
ReindexRecordsRequest:
type: object
properties:
recordIds:
type: array
items:
type: string
example: ["recordId1", "recordId2]
```
## Limit
We will limit the given number of records as 1000 initially
```M19 - Release 0.22Mingyang ZhuMingyang Zhuhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/108Poor performance for index augmenter2023-08-24T20:45:43ZZhibin MaiPoor performance for index augmenterThough we made several enhancements related to index augmenter directly or indirectly, such as creating separating re-index topic, splitting the big message with 1000 records to small message with 50 records to support parallel indexing,...Though we made several enhancements related to index augmenter directly or indirectly, such as creating separating re-index topic, splitting the big message with 1000 records to small message with 50 records to support parallel indexing, and etc. We still found that the index performance with augmenter enabled is much worse than the index performance with augmenter disabled. For example, for WellLog with multiple extension configurations, the performance with augmenter enabled is about 15 times slower than the performance with augmenter disabled.
With augmenter enabled,
1. Index one record individually, each record (for given property configurations) requires 8 queries to get all information in order to populate the extended properties. In this test test, cache does not take effect at all.
2. Index a kind with 291 WellLog records, each record requires 6.8 queries on average. In this test case, the cache should play important role. However, we found the cache mechanism basically does not take much effect.
As I ran the tests from local, the latency of search is about 1.5 times longer than the latency of search in cloud env. I estimated that the performance with augmenter enabled is still about 10 times slower if we don't make any enhancement.M20 - Release 0.23Zhibin MaiZhibin Maihttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/114The RelatedConditionMatches of the augmenter is not flexible2023-10-03T07:51:41ZZhibin MaiThe RelatedConditionMatches of the augmenter is not flexibleCurrent implementation of the RelatedConditionMatches in the augmenter has following limitations:
1. The condition match is text match only. The following two cases demonstrate that the regular expression match is needed:
##### Case 1:...Current implementation of the RelatedConditionMatches in the augmenter has following limitations:
1. The condition match is text match only. The following two cases demonstrate that the regular expression match is needed:
##### Case 1: Extend the properties from the related objects whose IDs are defined under data.LineageAssertions[].ID
```
{
"Name": "Document-IndexPropertyPathConfiguration",
"Code": "osdu:wks:work-product-component--Document:1.",
"AttributionAuthority": "OSDU",
"Configurations": [{
"Name": "AssociatedFacilityNames",
"Policy": "ExtractAllMatches",
"Paths": [{
"RelatedObjectsSpec": {
"RelationshipDirection": "ChildToParent",
"RelatedObjectID": "data.LineageAssertions[].ID",
"RelatedObjectKind": "osdu:wks:master-data--Wellbore:1.",
"RelatedConditionMatches": [
"^[\\w\\-\\.]+:master-data\\-\\-Wellbore:[\\w\\-\\.\\:\\%]+$"
],
"RelatedConditionProperty": "data.LineageAssertions[].ID"
},
"ValueExtraction": {
"ValuePath": "data.FacilityName"
}
}
]
}, {
"Name": "AssociatedProjectNames",
"Policy": "ExtractAllMatches",
"Paths": [{
"RelatedObjectsSpec": {
"RelationshipDirection": "ChildToParent",
"RelatedObjectID": "data.LineageAssertions[].ID",
"RelatedObjectKind": "osdu:wks:master-data--SeismicAcquisitionSurvey:1.",
"RelatedConditionMatches": [
"^[\\w\\-\\.]+:master-data\\-\\-SeismicAcquisitionSurvey:[\\w\\-\\.\\:\\%]+$"
],
"RelatedConditionProperty": "data.LineageAssertions[].ID"
},
"ValueExtraction": {
"ValuePath": "data.ProjectName"
}
}
]
}
]
}
]
}
```
##### Case 2: Match the reference data values in any data partition (or ignoring the data partition)
```
{
"Name": "WellLog-IndexPropertyPathConfiguration",
"Code": "osdu:wks:work-product-component--WellLog:1.",
"AttributionAuthority": "OSDU",
"Configurations": [{
"Name": "WellUWI",
"Policy": "ExtractFirstMatch",
"Paths": [{
"ValueExtraction": {
"RelatedConditionMatches": [
"^[\\w\\-\\.]+:reference-data--AliasNameType:UniqueIdentifier:$",
"^[\\w\\-\\.]+:reference-data--AliasNameType:RegulatoryName:$",
"^[\\w\\-\\.]+:reference-data--AliasNameType:PreferredName:$",
"^[\\w\\-\\.]+:reference-data--AliasNameType:CommonName:$",
"^[\\w\\-\\.]+:reference-data--AliasNameType:ShortName:$"
],
"RelatedConditionProperty": "data.NameAliases[].AliasNameTypeID",
"ValuePath": "data.NameAliases[].AliasName"
}
}
]
}
]
}
```
As required, to extend a property from a related record, the kind of the related record must be defined in the configuration. However, the Relationship type under ExtensionProperties does not define the kind of the target object. In some cases, the source record
Example: Extend the related object's name to the document, name of the related objects
2. RelatedConditionProperty is limited to be a property of one level nested object.
In the above examples, both `data.NameAliases[].AliasNameTypeID` and `data.ExtensionProperties.Relationships[].TargetID` are properties of one level nested object. In some cases, RelatedConditionProperty can be a property of multi-level nested object. For example
```
{
"Name": "WellLog-IndexPropertyPathConfiguration",
"Code": "osdu:wks:work-product-component--WellLog:1.",
"AttributionAuthority": "OSDU",
"Configurations": [{
"Name": "OrganisationNames",
"Policy": "ExtractAllMatches",
"Paths": [{
"RelatedObjectsSpec": {
"RelationshipDirection": "ChildToParent",
"RelatedObjectKind": "osdu:wks:master-data--Organisation:1.",
"RelatedObjectID": "data.TechnicalAssurances[].Reviewers[].OrganisationID"
"RelatedConditionMatches": [
"^[\\w\\-\\.]+:reference-data--ContactRoleType:ProjectManager:AccountOwner:$",
"^[\\w\\-\\.]+:reference-data--ContactRoleType:AccountOwner:$"
],
"RelatedConditionProperty": "data.TechnicalAssurances[].Reviewers[].RoleTypeID"
},
"ValueExtraction": {
"ValuePath": "data.OrganisationName"
}
}
]
}
]
}
```M21 - Release 0.24Thomas Gehrmann [slb]Zhibin MaiThomas Gehrmann [slb]https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/109ADR: Full reindex API access must be elevated2023-12-01T12:38:36ZNeha KhandelwalADR: Full reindex API access must be elevated[[_TOC_]]
# Status
* [x] Proposed
* [x] Trialing
* [x] Under review
* [ ] Approved
* [ ] Retired
# Context & Scope
Expected use-case for the full reindex API is for disaster recovery scenario as it reindexes everything in a data-part...[[_TOC_]]
# Status
* [x] Proposed
* [x] Trialing
* [x] Under review
* [ ] Approved
* [ ] Retired
# Context & Scope
Expected use-case for the full reindex API is for disaster recovery scenario as it reindexes everything in a data-partition.
Currently, full reindex API access is set to same level as other reindex APIs. Due to this, users with **users.datalake.admin** permission can **accidently** trigger a full reindex. To make matter worse, there are no APIs to cancel ongoing re-index, so this operation can run for hours/days depending on data-partition size. This can have impact on cost and service performance.
# Requirements
We need to elevate the permission level for the full reindex API so that users with Admin access cannot accidently trigger a full reindex.
# Tradeoff Analysis
This will be breaking change, but it should have low impact as this API is used very rarely/infrequently.
# Solution
The proposed solution is that the permission level for full reindex API should be elevated and set to **users.datalake.ops**.
# Consequences
* Change in indexer-core to Reindex API (permission elevation for full reindex) and PartitionSetup API (refactor)
* Indexer service documentation needs to be updated
# ADR Comments BelowM21 - Release 0.24https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/139Too many results returned after bagofwords feature2024-01-19T19:47:34ZGuillaume CailletToo many results returned after bagofwords featureHi,
When enabling the [BagOfWords feature](https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/113), some search query with a "query" filter return too many results.
I've reproduced the issue on several AWS env...Hi,
When enabling the [BagOfWords feature](https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/113), some search query with a "query" filter return too many results.
I've reproduced the issue on several AWS environment, and I don't have this issue if the indexer is deployed with the Feature flag `featureFlag.bagOfWords.enabled` set to False.
I have attached the 3 records and schema I used (these are from the `os-search` integration tests in `testing/integration-tests/search-test-core/src/main/resources/testData/records_1.json`)
[records.json](/uploads/196fce2d3f739b3c4349bd4e5075aeed/records.json)
[schema.json](/uploads/990d8ac4242d6a09921e16236f6a72e5/schema.json)
( I didn't delete these 3 records from the `main.osdu-gl.osdu.aws` environment, so if you have access to it, you should be able to reproduce these queries )
Once the records are indexed :
Issue a `search` query with the following payload:
```
{
"kind": "opendes:search1704732571020:test-data--Integration:1.0.1",
"query": "OFFICE9"
}
```
I have all 3 records returned, instead of 0 (there are no "OFFICE9" text in the 3 records)
Same if I use a "valid" query matching at least one record, for example
```
{
"kind": "opendes:search1704732571020:test-data--Integration:1.0.1",
"query": "OFFICE4"
}
```
Also returns 3 records instead of one.
This issue seems to occurs only when using digit suffix. If I use a letter, it works properly, for example
```
{
"kind": "opendes:search1704732571020:test-data--Integration:1.0.1",
"query": "OFFICEZ"
}
```
Properly returns 0 results.
I have managed to reproduce the issue directly on the elasticsearch server by using their REST API, so the issue is not with the Search service I think :
POST https://localhost:9200/opendes-search1704732571020-test-data--integration-1.0.1/_search (I'm using k8s port-forwarding to dircetly connect to the ES server)
with the following payload
```{
"from": 0,
"size": 10,
"timeout": "1m",
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"query_string": {
"query": "OFFICE9"
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
}
]
}
}
}
```
Returns 3 results when BagOfWords is enabled, only 1 if not.M22 - Release 0.25Mark ChanceStanisław BienieckiMark Chancehttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/137String array becomes String after index2024-01-24T08:54:10ZZhibin MaiString array becomes String after indexThe String array becomes String after it is indexed. Bug should be introduced by [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649)
To illustrate the problem, I used one example from Augm...The String array becomes String after it is indexed. Bug should be introduced by [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649)
To illustrate the problem, I used one example from Augmenter Configuration that has String array attributes.
- Storage Format of part of data payload:
![image](/uploads/9dacff15a729788fffb02e916b704569/image.png)
- Index (document) Format of part of data payload returned by method in class StorageIndexerPayloadMapper
```
public Map<String, Object> mapDataPayload(ArrayList<String> asIngestedCoordinatesPaths, IndexSchema storageSchema, Map<String, Object> storageRecordData,
String recordId) {
Map<String, Object> dataCollectorMap = new HashMap<>();
//..
mapDataPayload(storageSchema.getDataSchema(), storageRecordData, recordId, dataCollectorMap);
//...
return dataCollectorMap;
}
```
![image](/uploads/dfe1df18988936c5b137c542edd58c96/image.png)
- Search result before re-index from local indexer service with the [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649):
![image](/uploads/7714272f8aa0286c90b278e7546d8b33/image.png)
- Search result after re-index from local indexer service with the [MR 649](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/649):
![image](/uploads/b51dbecdc83cc6279b71017d1f8f1b61/image.png)M22 - Release 0.25Mark ChanceStanisław BienieckiMark Chancehttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/118Avoid using query by cursor if possible2023-12-02T13:58:46ZZhibin MaiAvoid using query by cursor if possibleIn M20, we created a MR [601](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/601) that tried to improve the performance of the augmenter and reduce the usage of the query with cursor. With the MR, w...In M20, we created a MR [601](https://community.opengroup.org/osdu/platform/system/indexer-service/-/merge_requests/601) that tried to improve the performance of the augmenter and reduce the usage of the query with cursor. With the MR, we only have two places (getting related children records) that use query with cursor.
However, it is expensive to use query with cursor, it allows max. 500 queries with cursor within one minutes in most of the Elasticsearch deployments. The reason that we still use queries with cursor is that the normal queries can return max. 10,000 records. When trying to fetch children records for a given set of parent records, we are not sure whether the returned results will exceed the 10,000.
During our stressful tests with large datasets, we found that there are lots of errors from the queries with cursor when re-indexing 100k wellbores that have 5M welllogs in total (each wellbore has 50 welllogs on average). Based on our knowledge on Augmenter, more than 99% of cases that the query results won't reach 10,000 records. We need to find a way to ensure both correctness (no result missed) and error-free from the queries.
The basic idea is that Augmenter will use normal queries by default. In case the totalCount from the query result reaches the limit (10000), query with cursor will be automatically kicked in.M22 - Release 0.25Zhibin MaiZhibin Mai