ADR: Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query
It is mostly like that the applications or systems may need to have its system/meta data searchable via OSDU search but those system/meta data are not expected to be included in the search results of normal keyword search,
for example, an application stores its system data in the storage under kind "xyz" (please ignore the kind syntax in this example)
When users try to search data with keyword "wellbore", the data from kind "xyz" should not be included in the search result if users do search as below:
Case 1:
{ "kind": "*:*:*:*", "query": "wellbore"}
When application (workflow) tries to search its system data with keyword "wellbore", the data from kind "xyz" should be included in the search result if the kind "xyz" is explicitly specified in the search query, e.g.
Case 2:
{ "kind": "xyz", "query": "wellbore"}
To achieve this objective and provide a general solution, we propose to use a reserved name in the "authority" or "source" field for kinds of the system/metadata.
If those kinds are not explicitly specified in the search query as the Case 1 above, the data from those kinds won't be included in the search result
If those kinds are explicitly specified in the search query as the Case 2 above, the data from those kinds will be included in the search result
The reserved name should be meaningful and odd (weird) enough to avoid naming conflict with the existing schema. It is an open question what it should be. Here a few proposals about the reserved name:
"system" -- it may be too common
"system-meta"
"system-meta-data" -- should not be common if it is used in as "authority"
Whether the reserved name in "authority" or "source" is another open question. Here is what we think:
Field
Pro
Con
authority
it can be precisely filtered those indices
it could cause name conflict among tenants in multi-tenants env when they share the same services
source
it should not cause name conflict among tenants in multi-tenants env if each tenant has its own authority for its kinds
it could be impossible precisely filtered those indices. If the entity type field has the same keyword, those indices will be filtered out too
Any input is welcomed before finalizing the solution.
Once we have a conclusion, Thomas will include this reserved keyword in the schema guide.
Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Is this ADR proposing to limit search results? Won't it be confusing to the users? I believe we didn't have such rules previously that hide part of the results, except access control lists.
Rustam, we should not use access control. User still needs to search and access those data in application specific workflows.
Basically, we have two kinds of data, system/metadata vs domain data, need to be stored and searched to support different workflows. For normal keyword search, it is expected that user wants to search the domain data. The system/metadata should not be included in the search result.
It is a common issue. OSDU just never addresses it before.
Thanks, @zhibinmai, I was referring to ACLs just as a way to hide some data from users (for those who have no access to that data). So, to access that data, the user should be granted the required access through the Entitlements(or Policy) service.
My question is, what should the user do to access system\metadata? Is it something that should be kept in mind?
If I understand it correctly, this ADR could help divide the search of the configurations from regular domain data searches, for example, Indexer extensions. Am I right?
@Rustam_Lotsmanenko Correct! IndexPropertyConfigurations is an example that should apply this kind of solution. Actually, we have more use cases, e.g. make the schema content searchable (using keyword search and facet search) in schema related workflows. OSDU search or say Elasticsearch is a powerful tool if we can fully make use of it.
To summarize, if we assume that system-meta-data has been chosen as a reserved word in the record authority field and index name, Search should add explicit exclusions to all requests. For example:
POST /*-*-*-*,-system-meta-data*/_search
To fetch anything with the authority of system-meta-data, a specific kind must be included in the request:
POST /system-meta-data-wks-reference-data--indexpropertypathconfiguration-1.0.0,-system-meta-data*/_search
To get a list of all system-meta-data kinds, users may use the Schema service:
GET /api/schema-service/v1/schema?latestVersion=true&authority=system-meta-data'
In case the user is willing to perform a multi-kind search for system-meta-data kinds, they must pass all required kinds to the request:
POST / api/search/v2/query {"kind": "system-meta-data:wks:reference-data--IndexPropertyPathConfiguration:1.1.0, system-meta-data:wks:reference-data--IndexPropertyPathConfiguration:1.0.0"}
@Rustam_Lotsmanenko You are basically correct. However, We should not let users explicitly exclude the kinds for system/metadata. By default, it will be excluded as like OSDU excludes the kinds started with dot today (Please see CrossTenantUtils.java).
We should distinguish these two kinds of workflows:
When I search domain data, I don't need to know about the kinds for system/metadata. I expect OSDU search will filter them for me.
When I search the system/metadata, I (or say the workflows) know what kind of system/metadata should be searched. The kind of the system/metadata will be explicitly specified in the OSDU search query by me or the workflows.
For IndexPropertyPathConfiguation, I think we will leave it as it is for now. The ADR is to support coming workflows with new kinds.
Thank you @zhibinmai,
The proposed solution is fine for addressing the raised issue. In general, we're not opposed to it. However, we believe it could be improved with a different, more transparent, and flexible approach.
One potential improvement is to update the Search API, adjust the interface, and introduce high-level segregation for data. For example, we could implement high-level data categorization like this:
Additional categories could also be introduced later.
You've mentioned that more types of system data will be introduced. What if we don't hide them under service behavior and, instead, introduce them as API body parameters?
This approach could be beneficial for users, as it would enable them to search configurations and system data more freely, without requiring knowledge of specific kinds or understanding underlying behavior.
It could be extended for new data categories.
We could keep searching domain data as default behavior to not interfere with existing workflows.
@Rustam_Lotsmanenko Thank you for your input. I have a few concerns about the proposed solution:
How to category the "kinds" to create the category field when records are indexed?
It needs to re-index all the existing records even 1 is resolved
It is flexible but might not benefit the end users as most of time, they search the domain data. Of course, we can set the category to "domain" as default value in the search service when it is not set.
Thanks for the feedback, @zhibinmai, please review:
Introduce the new schema parameter x-osdu-category. In case of absence, it would be considered as domain data.
There is no need to reindex already ingested domain data. However, the system/meta data does need to be reindexed.
I agree that it might not provide sufficient benefits. It depends on whether users find it useful to introduce high-level categories or not.
A possible way of implementation:
During indexing, we can check for a field x-osdu-category and handle it differently. Instead of modifying the existing authority or source schema properties, we can add a category at the beginning of the index name and enforce it using a rarely used delimiter. For example, three dots: ...configuration. This format is valid for Elasticsearch and unusual enough to avoid misuse. The full index name may look like this:
@Rustam_Lotsmanenko the objectives of the ADR try to find a solution to offset the side-effect that searchable system/metadata via OSDU search could downgrade the quality of the search results for common users who are only interested at the domain data.
So we need to find a way to distinguish system/metadata from the domain data so that all system/metadata can be excluded in the context of the domain data search.
The next question is whether we should exclude the system/metadata by default. There is pro & con for exclude the system/metadata by default:
Pro: Adding system/metadata in OSDU index won't change existing search (expectation) as the added system/metadata are excluded by OSDU search by default.
Con: The excluding logic is hidden and hardcoded in the OSDU search. We may argue that it is not a good approach as it is hardcoded and lacks flexibility.
Back to your latest proposal, I think you agreed to exclude system/metadata from search result by default. However, I have a few concerns about your proposal:
Though we can automatically map the kind name to index with added category prefix, it may trigger large scale of refactoring in the OSDU search and index search, especially the search service. For example, in OSDU search, in order to map kind name to index name, with a list of kinds, we need to parse the schema to make sure which index(indices) should add the category prefix when converting kind name to index name.
We are talking about similar goals but different approaches. In both case, some agreement is needed on Schema to identify system/metadata kinds and code changes on Search service for filtering. In addition to above two changes, second approach also require updates to Indexer service to interpret/act on x-osdu-category attribute.
We would also like this to be transparent, but at the same time our intention is not to rely on any special character combinations for system/metadata index names. We are making assumption, whatever characters prefix we choose, it will not be used by Search backend (Elasticsearch) now or near future.
We also consciously want to stay away from queries that targets all kinds ("kind": "*.*.*.*"). We have observed, as number of kinds grows in partition, you may see higher latency.
Also, just to be clear, this use-case is not limited to SLB, we have seen other requirement/issue in community that can benefit from this feature.
Yes, this will require changes in the Indexer and could result in overhead at this moment. If there is no need for high-level segregation now, we could return to this discussion later if it arises.
As mentioned, we're okay with proceeding with the proposed solution. However, we need to choose the reserved name carefully, as it also introduces the assumption that it will not be misused now or in the future.
Regarding the wildcard kind, that was just an example. It seems more beneficial when there is no prior knowledge of the required kinds, but the search can still be limited to specific categories.
We don't want to make decision on what should be reserved authority/source. It should come from Data Definition team so expected usage is clear in Community.
As part of this ADR, we should also document recommendation on reserved authority/source on Schema service.
Chad Leongchanged title from Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query to ADR: Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query
changed title from Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query to ADR: Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query
Should be fine from Azure's side. I see most of the doubts were clarified, hoping this would be socialized well so that existing search results and new results do not surprise users being already expecting system metadata in parent kind. Thanks!
Thanks @gehrmann. I wasn't even aware we have such good documentation on Schema & Kind.
These are separate documents/tutorial & not included in core-service documentation. Should these be linked back from Schema service tutorial. It will be pretty useful.