Stephen Whitley · 4a3d20a8
--- a/OSDU-(C)/Design-and-Implementation/API-Specifications/Documentation/consumption-services/DocumentIndexingService.md
+++ b/OSDU-(C)/Design-and-Implementation/API-Specifications/Documentation/consumption-services/DocumentIndexingService.md
+## Document Indexing Service
+This service indexes the unstructured content which is currently present in the form of documents - pdf, tiff, doc, etc. This helps in searching for any keywords, phrases with in documents. 
+## Scope
+Supported Document Formats include PDF, TIFF & JPEG/JPG
+## Indexing Steps
+Document Indexing is a pipeline, where the following steps are involved
+ - Get Metadata of Document from Datalake
+ - Download Document
+ - Split the document into multiple pages
+ - Generate Default/Thumbnail Image for each page
+ - Extract text for each page (OCR with Google Vision)
+ - Save text for each page in elastic search
+ - Save consolidated text for the document in elastic search
+```mermaid
+graph LR
+Start --> A[Get Document Metadata]
+A --> B[Download Document]
+B --> C[Split Into Pages]
+C --> D[Generate Page Image]
+D --> E[Extract Text]
+E --> F[Save Text In Solr]
+```
+## Accessing Service
+ This service gets triggered automatically when a record is updated. 
+## Authorization
+### Service Level Authorization
+A user needs to have the following group for service level authorization
+ - service.document-indexing.editor
+ ### Data Level Authorization
+ A request should contain the source reference. We should check the data authorization of a user specific to this source. 
+ For e.g., if the document indexing is happening for source/tenant as 'OGA'. We should validate the data authorization by checking if the user has the following group associated
+ data.oga.editor