IndexerService.md

## Indexer service

### Table of contents <a name="TOC"></a>

- [Indexer service](#indexer-service)
- [Introduction](#introduction)
- [Indexer API access](#indexer-api-access)
- [API Reference](#api-reference)
  - [Version info endpoint](#version-info-endpoint)
  - [Reindex](#reindex)
  - [Data Partition provision](#data-partition-provision)
  - [Schema change](#schema-change)
- [Schema Service adoption](#schema-service-adoption)
    * [R3 Schema Support](#r3-schema-support)
- [Troubleshoot Indexing Issues](#troubleshoot-indexing-issues)  
    * [Get indexing status](#get-indexing-status)

## Introduction <a name="introduction"></a>

The Indexer API provides a mechanism for indexing documents that contain structured or unstructured data. Documents and
indices are saved in a separate persistent store optimized for search operations. The indexer API can index any number
of documents.

The indexer is indexes attributes defined in the schema. Schema can be created at the time of record ingestion in OSDU Data Platform
via Schema Service. The Indexer service also adds number of OSDU Data Platform meta attributes such as id, kind,
parent, acl, namespace, type, version, legaltags, index to each record at the time of indexing.

## Indexer API access <a name="indexer-api-access"></a>

* Required roles

  Indexer service requires that users (and service accounts) have dedicated roles in order to use it. Users must be a member of `users.datalake.viewers` or `users.datalake.editors` or `users.datalake.admins`, `users.datalake.ops` roles can be assigned using the [Entitlements Service](/solutions/osdu/tutorials/core-services/entitlementsservice). Please look at the API documentation for specific requirements.

  In addition to service roles, users __must__ be a member of data groups to access the data.

* Required headers

  The OSDU Data Platform stores data in different partitions, depending on the different accounts in the OSDU system.

  A user may belong to more than one account.  As a user, after logging into the OSDU portal, you need to select the account you wish to be active.
  Likewise, when using the Search APIs, you need to specify the active account in the header called `data-partition-id`. The correct `data-partition-id` can be obtained from the CFS services. The `data-partition-id` enables the search within the mapped partition. e.g.
  ```
  data-partition-id: opendes
  ```

* Optional headers

  The `correlation-id` is a traceable ID to track the journey of a single request. The `correlation-id` can be a GUID on the header with a key. It is best practice to provide the correlation-id so the request can be tracked through all the services.
  ```
  correlation-id: 1e0fef08-22fd-49b1-a5cc-dffa21bc0b70
  ```
If the service is initiating the request, an ID should be generated. If the `correlation-id` is not provided, then a new ID will be generated by the service so that the request would be traceable.

[Back to table of contents](#TOC)

## API Reference

### Version info endpoint

Provides build and git related information.

#### Request

```http
GET /api/indexer/v2/info HTTP/1.1
```

#### Example response:

```json
{
  "groupId": "org.opengroup.osdu",
  "artifactId": "indexer-gcp",
  "version": "0.10.0-SNAPSHOT",
  "buildTime": "2021-07-09T14:29:51.584Z",
  "branch": "feature/GONRG-2681_Build_info",
  "commitId": "7777",
  "commitMessage": "Added copyright to version info properties file",
  "connectedOuterServices": [
    {
      "name": "elasticSearch",
      "version": "..."
    },
    {
      "name": "redis",
      "version": "..."
    }
  ]
}
```

This endpoint takes information from files, generated by `spring-boot-maven-plugin`, `git-commit-id-plugin` plugins. Need to specify paths for generated files to matching properties:

- `version.info.buildPropertiesPath`
- `version.info.gitPropertiesPath`

[Back to table of contents](#TOC)

### Reindex <a name="reindex"></a>

Reindex API allows users to re-index a `kind` without re-ingesting the records via storage API. Reindexing a kind is an asynchronous operation and when a user calls this API, it will respond with HTTP 200 if it can launch the re-indexing or
appropriate error code if it cannot. The current status of the indexing can be tracked by calling search API and making query with this particular kind. Please be advised, it may take few seconds to few hours to finish the re-indexing as
multiple factors contribute to latency, such as number of records in the kind, current load at the indexer service etc.

#### Request

```http
POST /api/indexer/v2/reindex HTTP/1.1
{
  "kind": "opendes:welldb:wellbore:1.0.0"
}
```

<details><summary>**Curl**</summary>

```bash
curl --request POST \
  --url '/api/indexer/v2/reindex' \
  --header 'accept: application/json' \
  --header 'authorization: Bearer <JWT>' \
  --header 'content-type: application/json' \
  --header 'data-partition-id: opendes' \
  --data '{
  "kind": "opendes:welldb:wellbore:1.0.0"
}'
```

</details>

#### Prerequisite

Users must be a member of `users.datalake.admins` or `users.datalake.ops` group.

#### Query parameters

`force_clean` <br />
&emsp;&emsp;(optional, Boolean) If a kind has been previously indexed with a schema and if you wish to apply latest schema changes before re-indexing, than use this query parameter. It will drop the current Index schema, apply latest schema changes & re-index records. If `false`, reindex API
will use the same schema and overwrite records with the same ids. Default value is `false`.

#### Request body

`kind` <br />
&emsp;&emsp;(required, String) Kind to be re-indexed. 


[Back to table of contents](#TOC)
## Delete API <a name="delete"></a>
Delete API is used to delete an index for a specific kind.
Only users who belong to the Entitlement groups 'users.datalake.ops' can make calls to this API.

```
DELETE /api/indexer/v2/index?kind=opendes:welldb:wellbore:1.0.0
```

<details><summary>**Curl**</summary>

```bash
curl --request DELETE \
  --url '/api/indexer/v2/index?kind=opendes:welldb:wellbore:1.0.0' \
  --header 'authorization: Bearer <JWT>' \
  --header 'content-type: application/json' \
  --header 'data-partition-id: opendes' 
```

### Data Partition provision <a name="data-partition-provision"></a>

Configures Search backend for a data partition. 

```http 
PUT /api/indexer/v2/partitions/provision HTTP/1.1
```

<details><summary>**Curl**</summary>

```bash
curl --request PUT \
  --url '/api/indexer/v2/partitions/provision' \
  --header 'accept: application/json' \
  --header 'authorization: Bearer <JWT>' \
  --header 'content-type: application/json' \
  --header 'data-partition-id: opendes''
```
</details>

#### Prerequisite

Users must be a member of `users.datalake.ops` group.

> __NOTE__: API should be run at-least once at the data partition provisioning to configure required resources/settings.

[Back to table of contents](#TOC)

### Schema change <a name="schema-change"></a>

Schema change event listener endpoint.

> __Note:__ This is internal API and shouldn't be exposed publicly.

#### Request

```http
POST /api/indexer/v2/_dps/task-handlers/schema-worker HTTP/1.1
{
    "messageId": "676894654",
    "publishTime": "2017-03-19T00:00:00",
    "attributes": {
        "data-partition-id": "opendes",
        "correlation-id": "b5a281bd-f59d-4db2-9939-b2d85036fc7e"
    },
    "data": "[{\"kind\":\"slb:indexer:test-data--SchemaEventIntegration:1.0.0\",\"op\":\"create\"}]"
}
```

#### Request body

`messageId` <br />
&emsp;&emsp;(optional, String) Event message id.

`publishTime` <br />
&emsp;&emsp;(optional, String) Event publish time.

`attributes.data-partition-id` <br />
&emsp;&emsp;(required, String) Data partition id for which this message is targeted.

`attributes.correlation-id` <br />
&emsp;&emsp;(optional, String) Correlation-id to enable tracing.

`data` <br />
&emsp;&emsp;(required, String) Schema change event message json string. Only `create` and `update` events are supported.

## Schema Service adoption <a name="schema-service-adoption"></a>

Indexer service is in adaptation process to use schemas from the Schema service instead of Storage Service. The Indexer
Service retrieves a schema from the Schema Service if the schema is not found on the Storage Service. Change affects
only Azure implementation so far. Later call to the Storage Service will be deprecated and then removed (after the end
of the deprecation period).

[Back to table of contents](#TOC)

### R3 Schema Support <a name="r3-schema-support"></a>

Indexer service support r3 schema. These schemas are created via Schema service.

Here is an example following end-to-end workflow can be exercised (please update the schema based on your environment):

* Ingest r3 schema for `opendes:wks:master-data--Wellbore:1.0.0`. Schema service payload can be
  found [here](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/testing/indexer-test-core/src/main/resources/testData/r3-index_record_wks_master.schema.json)
  .

* Ingest r3 master-data Wellbore record. Storage service payload can be
  found [here](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/testing/indexer-test-core/src/main/resources/testData/r3-index_record_wks_master.json)

* Records can be searched via Search service. Here is sample payload:

```
POST /api/search/v2/query HTTP/1.1
Content-Type: application/json
data-partition-id: opendes
{
    "kind": "opendes:wks:master-data--Wellbore:1.0.0",
    "spatialFilter": {
        "field": "data.SpatialLocation.Wgs84Coordinates",
        "byBoundingBox": {
            "topLeft": {
                "longitude": -100.0,
                "latitude": 52.0
            },
            "bottomRight": {
                "longitude": 100.0,
                "latitude": 0.0
            }
        }
    }
}
```

[Back to table of contents](#TOC)

# Troubleshoot Indexing Issues <a name="troubleshoot-indexing-issues"></a>

## Get indexing status <a name="get-indexing-status"></a>

Indexer service adds internal metadata to each record which registers the status of the indexing. The meta data includes
the status and the last indexing date and time. This additional meta block helps to see the details of indexing. The
format of the index meta block is as follows:

```json
{
  "index": {
    "trace": [
      String,
      String
    ],
    "statusCode": Integer,
    "lastUpdateTime": Datetime
  }
}
```

Example:

```json
{
  "results": [
    {
      "index": {
        "trace": [
          "datetime parsing error: unknown format for attribute: endDate | value: 9000-01-01T00:00:00.0000000",
          "datetime parsing error: unknown format for attribute: startDate | value: 1990-01-01T00:00:00.0000000"
        ],
        "statusCode": 400,
        "lastUpdateTime": "2018-11-16T01:44:08.687Z"
      }
    }
  ],
  "totalCount": 31895
} 
```

Details of the index block:

1) trace: This field collects all the issues related to the indexing and concatenates using '|'. This is a String field.
2) statusCode: This field determines the category of the error. This is integer field. It can have the following values:
    * 200 - All OK
    * 404 - Schema is missing in Storage
    * 400 - Some fields were not properly mapped with the schema defined
3) lastUpdateTime: This field captures the last time the record was updated by the indexer service. This is datetime
   field so you can do range queries on this field.

You can query the index status using the following example query:

```bash
curl --request POST \
  --url /api/search/v2/query \
  --header 'Authorization: Token' \
  --header 'Content-Type: application/json' \
  --header 'data-partition-id: Data partition id' \
  --data '{"kind": "*:*:*:*","query": "index.statusCode:404","returnedFields": ["index"]}'
  
NOTE: By default, the API response excludes the 'index' attribute block. The user must specify 'index' as the 'returnedFields" in order to see it in the response.
```

The above query will return all records which had problems due to fields mismatch.

Please refer to the [Search service](https://community.opengroup.org/osdu/platform/system/search-service/-/blob/master/docs/api/search_openapi.yaml#L28) documentation for examples on different kinds of search
queries.

[Back to table of contents](#TOC)