|
|
## Document Indexing Service
|
|
|
This service indexes the unstructured content which is currently present in the form of documents - pdf, tiff, doc, etc. This helps in searching for any keywords, phrases with in documents.
|
|
|
|
|
|
|
|
|
## Scope
|
|
|
Supported Document Formats include PDF, TIFF & JPEG/JPG
|
|
|
|
|
|
## Indexing Steps
|
|
|
Document Indexing is a pipeline, where the following steps are involved
|
|
|
|
|
|
- Get Metadata of Document from Datalake
|
|
|
- Download Document
|
|
|
- Split the document into multiple pages
|
|
|
- Generate Default/Thumbnail Image for each page
|
|
|
- Extract text for each page (OCR with Google Vision)
|
|
|
- Save text for each page in elastic search
|
|
|
- Save consolidated text for the document in elastic search
|
|
|
|
|
|
```mermaid
|
|
|
graph LR
|
|
|
Start --> A[Get Document Metadata]
|
|
|
A --> B[Download Document]
|
|
|
B --> C[Split Into Pages]
|
|
|
C --> D[Generate Page Image]
|
|
|
D --> E[Extract Text]
|
|
|
E --> F[Save Text In Solr]
|
|
|
|
|
|
```
|
|
|
## Accessing Service
|
|
|
This service gets triggered automatically when a record is updated.
|
|
|
|
|
|
## Authorization
|
|
|
### Service Level Authorization
|
|
|
A user needs to have the following group for service level authorization
|
|
|
|
|
|
- service.document-indexing.editor
|
|
|
|
|
|
### Data Level Authorization
|
|
|
A request should contain the source reference. We should check the data authorization of a user specific to this source.
|
|
|
For e.g., if the document indexing is happening for source/tenant as 'OGA'. We should validate the data authorization by checking if the user has the following group associated
|
|
|
|
|
|
data.oga.editor |