ADR: Provide search capability in legal tags
Context
Is it possible to add the search capability for legal tags based on the legaltag attributes including the ones in extensionProperties? There might be hundreds of thousands of Legaltags. Need to count on the number of legaltags while designing the solutions.
For more details please check #36
Problem Statement
Is it possible to add the search capability for legal tags based on the legaltag attributes including the ones in extensionProperties? There might be hundreds of thousands of Legaltags. Need to count on the number of leagltags while designing the solutions.
Potential Solution Approach
There is a thought to implement the solution using Regex. Using Regex seems a pretty simple approach both for usage and implementation. "offset" and "sort" will still require further discussion and can be implemented in iteration.
This solution calls for retrieving each legal tag document from the datastore which might cause a potential performance issue.
Either ES can be used to index and search for Legaltags documents.
Post our team discussion on May, 9 these are the various approaches that came up:
- As the underlying architecture or Legal Service is specific to CSPs, we need to implement the search specific to CSP. in AWS, the legal datastore in DynamoDB with Mongo DB as an additional config option. In Azure it's the Cosmos Datastore. Google stores as an Object Datastore. For AWS, the DynamoDB table currently has no index. In that case, full table scan is the only option to run search on legal tags. While this may seem easiest approach solution wise, there might come performance issue if the legal tags grow potentially higher in number. Whether we should go for this approach would depend on the POC being conducted on the level of hundreds and thousands of legaltags inn Dynamo.
- Next approach could be to add in index to the datastore. This might have to be considered if there is considerable performance overhead found in the POC. Details of the POC provided below. This might not be the quickest one to implement considering different CSPs that need to be involved while creating the legal tag datastore table.
- Another approach is to totally delegate the search operation of the Legaltags to something like ElasticSearch. In that case need to following the same as Storage service and Search service.
- Use the List LegalTags API to fetch the legal tags record and search on each of the Tags. After initial analysis we need to work more on search on each of these LegalTag objects. Post detailed requirements discussion it was observed that complex search operations need to be supported. To satisfy the requirement it's better to use a service which supports search. Implementing in application tier is not a feasible option.
Tasks
- Identify the persistent layer of the legal tag service - Identified. CSP specific.
- run POC to implement different use cases so as to justify the decision which should be undertaken for the solution - Ongoing
- Run the Legal Service locally.
- Run PoC on Elastic Search to index legal tags and then execute search on legal tags.
POC
POC is being implemented to fetch legal tags from the datastore. Currently it is being conducted considering Dynamo as the AWS Implementation. The upper limit to be considered is few hundreds of thousands of data store. Currently this is a standalone Java application.
Next phase of POC would involve invoking OSDU Core API Services to retrieve > 10k records from Dynamo DB and noting the performance.
Another scenario to consider once data records are retrieved how is the search especially on extension properties tag coming along.
Open Questions and answers based on discussions on May 25, 2023
- Structure of ExtensionProperties attributes has no definitive format. It will be different for different companies. Shell might follow a structure which might be different for other companies. So extensionproperties need to be pretty much flexible
- The formatting of all LegalTags are pretty much standardized. We are referring to the structure outside of the extensionproperties. Currently, it’s having Name, Description, properties, isValid attributes other that id and dataPartitionId
- Search should support all attribute search in legaltags
- Expected new API to have POST request with body which will contain the query. Query can be pretty complex.
- Mostly the feature will be available as a separate API under legal services.
- Multi- attribute search along with complex queries
- Sort and Search together need to be supported along with limit and offset
- The response should return the entire legalTag with all attributes.
- DSL preferred to Lucene syntax – need to revisit this implementationwise.
Findings
We have implemented a POC to run a full DynamoDB table scan. Currently the Dynamo DB is deployed I M16 environment ad the POC is being run from local laptop. This is only for the AWS implementation. We have still not researched the idea for other CSPs. We are running different scenarios and recording the time the POC is taking. The scenario
Scenario | Time Recorded |
---|---|
Full table scan. Currently table has 10,528 items | < 0.020Sec |
Full table scan along with filter condition (isValid = 1) | <0.020sec |
Program involving full table scan along with flyer condition and comparing values in the entire table. Example search for 'US' under properties.countryOfOrigin. Table scan time remains same. | Program execution time: 18.2sec |
Program involving full table scan along with flyer condition and comparing values from extension properties attribute in the entire table. Example search for 'RDS' under properties.extensionproperties.AgreementParty. Table scan time remains same. This includes the user input time for the search pattern(as "RDS") and search attribute (as properties.extensionproperties.AgreementParty) | Program execution time: ~14sec |
Executed the OSDU API (List LegalTags) to retrieve the legaltags. Total program execution time includes establishing the connection to the remote API, and printing the response to standard output. | 3.598Sec |