This service indexes the unstructured content which is currently present in the form of documents - pdf, tiff, doc, etc. This helps in searching for any keywords, phrases with in documents.
## Scope
Supported Document Formats include PDF, TIFF & JPEG/JPG
## Indexing Steps
Document Indexing is a pipeline, where the following steps are involved
- Get Metadata of Document from Datalake
- Download Document
- Split the document into multiple pages
- Generate Default/Thumbnail Image for each page
- Extract text for each page (OCR with Google Vision)
- Save text for each page in elastic search
- Save consolidated text for the document in elastic search
```mermaid
graph LR
Start --> A[Get Document Metadata]
A --> B[Download Document]
B --> C[Split Into Pages]
C --> D[Generate Page Image]
D --> E[Extract Text]
E --> F[Save Text In Solr]
```
## Accessing Service
This service gets triggered automatically when a record is updated.
## Authorization
### Service Level Authorization
A user needs to have the following group for service level authorization
- service.document-indexing.editor
### Data Level Authorization
A request should contain the source reference. We should check the data authorization of a user specific to this source.
For e.g., if the document indexing is happening for source/tenant as 'OGA'. We should validate the data authorization by checking if the user has the following group associated