Federated Search PoC Design for Multi-Region OSDU
Background
Multi-Region OSDU has long been an important yet complex community initiative since it was first introduced in Feb 2020 in this ADR. It was brought back to community focus in the June 2022 EA sub-committee F2F meeting. Since August of 2022, PETRONAS initiated the conversation with MSFT and SLB to collaborate and started brainstorming and researching requirements and feasible solutions. In the meantime, Shell has done a thorough and wide-reaching user voice gathering and produced a comprehensive and detailed “Multi Region Use Cases” document. Based on these efforts, MSFT has created a phased design proposal for multi-region OSDU that aligns with customer requirements as closely as possible and can be tackled in an incremental fashion. The outcome and test results from a prior phase PoC can determine whether a next phase is needed, and which direction should be taken in the next phase if needed. Once a PoC with acceptable SLA is reached, the implementation can begin. This phased approach will optimize the efforts and outcome.
Scope and Purpose
This doc is intended to be used as a guide for the PoC work for the first phase, Federated Search approach. It includes detailed descriptions of the Federated Search approach, what new APIs are needed and how to implement them, what tests are needed, and what performance data will be collected to come up with estimated latency numbers under typical use scenarios. These numbers will determine whether the Federated Search approach meets user expectations. If it does, it can be implemented with the lowest cost compared with other approaches and deliver benefits to customers quickly. If not, we can move on to the next architectural model. Even if it is determined to be not meeting user expectations, the PoC work will be useful to pinpoint bottlenecks and gaps and establish baseline latency data for the next phase implementation.
Federated Search Details
The Federated Search approach builds on top of the current OSDU implementation and requires no change to the existing APIs in the existing OSDU services. Several new federated search APIs will be added to Search service, Storage service, Well Delivery DDMS and Wellbore DDMS, and a new multi-region configuration service will be created for multi-region administration and configuration tasks. Three of the new APIs are in scope of the PoC. The new configuration service is not in scope of the PoC; manual configuration will be done instead.
How Federated Search Works
Deployment
A global OSDU Administrator deploys multiple instances of OSDU to multiple CSP regions, each instance with its own data partitions. Data ingestion works the same as today. A data record only exists in the OSDU instance in the home region where the data is ingested. There is no raw data replication or catalog data replication or search index replication between instances.
Region and Group Configuration
The Administrator creates a global “Regions” table to hold all deployed OSDU instances’ IDs, endpoints, and data partition names. The Administrator configures “Cross-Region Partition Groups” to specify which partitions are logically associated to form a cross region search group. Multiple of such groups can be configured. The partition ID in the “Cross-Region Partition Groups” is in the format of “Instance ID: Partition Name” to guarantee its uniqueness. The configuration APIs will be included in the new Multi-region Configuration Service in actual implementation. In PoC, the configuration will be done by manually creating two Cosmos DB tables.
Two New Global Search APIs and One New Global Storage Query API
A new “POST /api/global_search/v2/query” and “POST /api/global_search/v2/query_with_cursor” APIs will be added to the core Search Service. They are the counter APIs for the current local versions of the “POST /api/search/v2/query” and “POST /api/search/v2/query_with_cursor” APIs that only search in a local partition. The Global Search APIs conduct search in a “Cross-Region Partition Group” across multiple regions.
In Storage Service, there are a set of four query APIs that retrieve data records in a local partition. Four global versions of these query APIs will be implemented to retrieve records from a “Cross-Region Partition Group”. For simplicity, in PoC, only one new global query API “GET /api/storage/v2/global_query/kinds” will be created in the Storage Service as the global counter for local API “GET /api/storage/v2/query/kinds”.
In Wellbore DDMS and Well Delivery DDMS, there are several local query APIs. The global version of these APIs will need to be added in actual implementation. For simplicity, they will not be included in PoC. The request body and request headers for the new global APIs are the same as the local versions except the Partition ID in the request header. The new APIs require a “Cross-Region Partition Group” ID instead of a single Partition ID. This ID specifies the default partitions for the search. An optional query parameter “remote_partitions” allows user to select a subset of the default partitions to search. For example, if a group includes “Instance 1: Partition A, Instance 2: Partition A and Instance 3: Partition A” and “remote_partitions” query parameter has value of “Instance 1: Partition A, Instance 3: Partition A”, the global search will skip searching in Instance 2: Partition A. Optional query parameter provides flexibility for users to skip remote partitions to optimize performance.
The global APIs return the aggregation of the search results from all the partitions belonging to the group or the subset specified in the optional query parameter. The local partition is always implicitly included in the search. Optional query parameter will be validated against the “Cross-Region Partition Group” ID in the header. If any of the remote partitions does not belong to the group, “400 Bad Request” error is returned. For simplicity, validation will not be implemented in PoC.
POST /api/global_search/v2/query?remote_partitions=p1, p2, …
Header: cross-region-partition-group-id
POST /api/global_search/v2/query_with_cursor?remote_partitions=p1, p2, …
Header: cross-region-partition-group-id
GET /api/storage/v2/global_query/kinds?remote_partitions=p1, p2, …
Header: cross-region-partition-group-id
For the new global APIs, the user must meet the entitlement requirements for all the participating partitions’ data access. Otherwise, global APIs will return “401 Unauthorized” error.
Testing Federated Search under Multi-Region Deployment
To adequately test latency numbers, one multi-region deployment and multiple tests will be carried out for PoC.
Deployment and configuration
Three OSDU on Azure instances will be created in three different Azure regions across different continents (West US, West Europe, and East Asia) with each instance having one data partition. Manually create the “Regions” table and a “Cross-Region Partition Groups” table with one group that includes the three partitions from the three instances. Configure remote settings for the three Elastic Search clusters to have two remote clusters for each using Elastic Cluster Update Settings API.
Tests
Upload different test data sets to each instance and run manual tests in Postman. Test local search and storage query APIs and their counter global APIs with the same request body and log response time for each test. When running global API tests, choose different optional query parameters to select different number of remote partitions. Test with different kinds and availability of data. The delta between the response time from local request and that from global request can be a rough estimate for latency caused by cross region search.
Local | Global |
---|---|
Search/query | Search/query with 1 remote partition |
Search/query with 2 remote partitions | |
Search/query_with_cursor Initial request | Search/query_with_cursor Initial request with 1 remote partition |
Search/query_with_cursor Initial request with 2 remote partitions | |
Search/query_with_cursor request | Search/query_with_cursor request with 1 remote partition |
Search/query_with_cursor request with 2 remote partitions | |
Storage/query/kinds | Storage/query/kinds with 1 remote partition |
Storage/query/kinds with 2 remote partitions |
Next Step After Testing
After the above tests are done and latency data is collected, the direction for the next step will be determined from the latency data.
If the latency data does not meet SLA, it indicates Federated Search approach is not an acceptable option and we will move to the next architecture model of “External Partition” that requires EDS shadow record replication to decrease search latency.
On the other hand, if latency data is within acceptable SLA, an additional PoC step can be done, which is to test fetching relatively large amount of raw data from a remote region using File service. The details about this PoC step will be created later when deemed necessary.
POC Code Branch
- Search POC branch – Code repository
- Storage POC branch – Code repository