ADR: Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query

changed milestone to %M20 - Release 0.23

Is this ADR proposing to limit search results? Won't it be confusing to the users? I believe we didn't have such rules previously that hide part of the results, except access control lists.

Rustam, we should not use access control. User still needs to search and access those data in application specific workflows.

Basically, we have two kinds of data, system/metadata vs domain data, need to be stored and searched to support different workflows. For normal keyword search, it is expected that user wants to search the domain data. The system/metadata should not be included in the search result.

It is a common issue. OSDU just never addresses it before.

Thanks, @zhibinmai, I was referring to ACLs just as a way to hide some data from users (for those who have no access to that data). So, to access that data, the user should be granted the required access through the Entitlements(or Policy) service.

My question is, what should the user do to access system\metadata? Is it something that should be kept in mind?

If I understand it correctly, this ADR could help divide the search of the configurations from regular domain data searches, for example, Indexer extensions. Am I right?

@Rustam_Lotsmanenko Correct! IndexPropertyConfigurations is an example that should apply this kind of solution. Actually, we have more use cases, e.g. make the schema content searchable (using keyword search and facet search) in schema related workflows. OSDU search or say Elasticsearch is a powerful tool if we can fully make use of it.

changed the description

To summarize, if we assume that system-meta-data has been chosen as a reserved word in the record authority field and index name, Search should add explicit exclusions to all requests. For example:

POST /*-*-*-*,-system-meta-data*/_search

To fetch anything with the authority of system-meta-data, a specific kind must be included in the request:

POST /system-meta-data-wks-reference-data--indexpropertypathconfiguration-1.0.0,-system-meta-data*/_search

To get a list of all system-meta-data kinds, users may use the Schema service:

GET /api/schema-service/v1/schema?latestVersion=true&authority=system-meta-data'

In case the user is willing to perform a multi-kind search for system-meta-data kinds, they must pass all required kinds to the request:

POST / api/search/v2/query 
{
"kind": "system-meta-data:wks:reference-data--IndexPropertyPathConfiguration:1.1.0, system-meta-data:wks:reference-data--IndexPropertyPathConfiguration:1.0.0"
}

@Rustam_Lotsmanenko You are basically correct. However, We should not let users explicitly exclude the kinds for system/metadata. By default, it will be excluded as like OSDU excludes the kinds started with dot today (Please see CrossTenantUtils.java).

We should distinguish these two kinds of workflows:

When I search domain data, I don't need to know about the kinds for system/metadata. I expect OSDU search will filter them for me.
When I search the system/metadata, I (or say the workflows) know what kind of system/metadata should be searched. The kind of the system/metadata will be explicitly specified in the OSDU search query by me or the workflows.

For IndexPropertyPathConfiguation, I think we will leave it as it is for now. The ADR is to support coming workflows with new kinds.

Thank you @zhibinmai, The proposed solution is fine for addressing the raised issue. In general, we're not opposed to it. However, we believe it could be improved with a different, more transparent, and flexible approach.

One potential improvement is to update the Search API, adjust the interface, and introduce high-level segregation for data. For example, we could implement high-level data categorization like this:


{
  "kind": "*.*.*.*",
  "category": "configurations",
  "query": "wellbore"
}

{
  "kind": "*.*.*.*",
  "category": "domain",
  "query": "wellbore"
}

Additional categories could also be introduced later.

You've mentioned that more types of system data will be introduced. What if we don't hide them under service behavior and, instead, introduce them as API body parameters?

This approach could be beneficial for users, as it would enable them to search configurations and system data more freely, without requiring knowledge of specific kinds or understanding underlying behavior.

It could be extended for new data categories.

We could keep searching domain data as default behavior to not interfere with existing workflows.

Please share your thoughts!

@Rustam_Lotsmanenko Thank you for your input. I have a few concerns about the proposed solution:

How to category the "kinds" to create the category field when records are indexed?
It needs to re-index all the existing records even 1 is resolved
It is flexible but might not benefit the end users as most of time, they search the domain data. Of course, we can set the category to "domain" as default value in the search service when it is not set.

Thanks for the feedback, @zhibinmai, please review:

Introduce the new schema parameter x-osdu-category. In case of absence, it would be considered as domain data.
There is no need to reindex already ingested domain data. However, the system/meta data does need to be reindexed.
I agree that it might not provide sufficient benefits. It depends on whether users find it useful to introduce high-level categories or not.

A possible way of implementation:

During indexing, we can check for a field x-osdu-category and handle it differently. Instead of modifying the existing authority or source schema properties, we can add a category at the beginning of the index name and enforce it using a rarely used delimiter. For example, three dots: ...configuration. This format is valid for Elasticsearch and unusual enough to avoid misuse. The full index name may look like this:

...configuration-osdu-wks-reference-data--indexpropertypathconfiguration-1.0.0

Handling this in search would be relatively straightforward. Regular search requests for domain data:

{
  "kind": "*.*.*.*",
  "query": "wellbore"
}

Would become:

POST /*-*-*-*,-...*/_search

For categorized requests:

{
  "kind": "*.*.*.*",
  "category": "configurations",
  "query": "wellbore"
}

We can use:

POST /...configurations-*-*-*-*,-...*/_search

Pros:

New categories could be introduced with time without code modifications.
No need to worry about conflicts in a multitenant environment. The only requirement would be to not use ... at the beginning of the schema authority.

Cons:

May not be beneficial if such data segregation remains rarely used.
Changes required in several services like Indexer and Search."

@Rustam_Lotsmanenko It is great. Let see whether there is any input from other reviewers about your proposal. Thank you!

@Rustam_Lotsmanenko the objectives of the ADR try to find a solution to offset the side-effect that searchable system/metadata via OSDU search could downgrade the quality of the search results for common users who are only interested at the domain data.

So we need to find a way to distinguish system/metadata from the domain data so that all system/metadata can be excluded in the context of the domain data search.

The next question is whether we should exclude the system/metadata by default. There is pro & con for exclude the system/metadata by default: Pro: Adding system/metadata in OSDU index won't change existing search (expectation) as the added system/metadata are excluded by OSDU search by default. Con: The excluding logic is hidden and hardcoded in the OSDU search. We may argue that it is not a good approach as it is hardcoded and lacks flexibility.

Back to your latest proposal, I think you agreed to exclude system/metadata from search result by default. However, I have a few concerns about your proposal:

We can't use the dot prefix. @nthakur reminded us that elasticsearch follows Linux file naming convention and designates any index starting with dot as hidden or system indices. Here is relevant documentation: https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html#system-indices.
Though we can automatically map the kind name to index with added category prefix, it may trigger large scale of refactoring in the OSDU search and index search, especially the search service. For example, in OSDU search, in order to map kind name to index name, with a list of kinds, we need to parse the schema to make sure which index(indices) should add the category prefix when converting kind name to index name.

@gehrmann @nthakur @mzhu9 any suggestion or comment is welcomed.

@zhibinmai Thank you for your response, and sorry for my late response.

Regarding your concerns:

Yes, indices are hidden by default, but we can control this using special settings https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-hidden. This could help keep our data separate. But in General, we don't have to use dots, we can use different symbols if needed.
Yes, it could trigger refactoring. But it seems only a few bits of code, like https://community.opengroup.org/osdu/platform/system/search-service/-/blob/master/search-core/src/main/java/org/opengroup/osdu/search/util/CrossTenantUtils.java, would need adjustments. Also, we probably don't need extra requests to the Schema service. Users will mention what category they're interested in using "category": "configurations" and we'll combine it with "kind": "*.*.*.*".

FYI: @andrei_dalhikh

@Rustam_Lotsmanenko

We are talking about similar goals but different approaches. In both case, some agreement is needed on Schema to identify system/metadata kinds and code changes on Search service for filtering. In addition to above two changes, second approach also require updates to Indexer service to interpret/act on x-osdu-category attribute.

We would also like this to be transparent, but at the same time our intention is not to rely on any special character combinations for system/metadata index names. We are making assumption, whatever characters prefix we choose, it will not be used by Search backend (Elasticsearch) now or near future.

We also consciously want to stay away from queries that targets all kinds ("kind": "*.*.*.*"). We have observed, as number of kinds grows in partition, you may see higher latency.

Also, just to be clear, this use-case is not limited to SLB, we have seen other requirement/issue in community that can benefit from this feature.

Thanks, @nthakur, Good point.

Yes, this will require changes in the Indexer and could result in overhead at this moment. If there is no need for high-level segregation now, we could return to this discussion later if it arises.

As mentioned, we're okay with proceeding with the proposed solution. However, we need to choose the reserved name carefully, as it also introduces the assumption that it will not be misused now or in the future.

Regarding the wildcard kind, that was just an example. It seems more beneficial when there is no prior knowledge of the required kinds, but the search can still be limited to specific categories.

Thanks @Rustam_Lotsmanenko for the agreement.

We don't want to make decision on what should be reserved authority/source. It should come from Data Definition team so expected usage is clear in Community.

As part of this ADR, we should also document recommendation on reserved authority/source on Schema service.

I propose system-meta-data as the reserved authority for schema kinds.

The name is long enough;
The name is explicit enough to describe the purpose.
I'll add this to the M20 Schema Usage Guide

@Rustam_Lotsmanenko @nthakur @gehrmann It is great that we come to agreement. Thank you all!

I will follow up to add this minor change in search service.

mentioned in issue osdu/platform/security-and-compliance/legal#38 (closed)

changed title from Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query to ADR: Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query

@ydzeng @Srinivasan_Narayanan @vikasrana @Java1Guy - Please do provide your feedback.

cc: @omprakash_epam @thulasi_dass @deepapathak For your review and comments

mentioned in merge request !543 (merged)

Should be fine from Azure's side. I see most of the doubts were clarified, hoping this would be socialized well so that existing search results and new results do not surprise users being already expecting system metadata in parent kind. Thanks!

Thanks @zhibinmai for taking care this, as a consequence of this issue, new reserved authority is introduced.

Can we update the issue with OSDU DD team decision for tracking?

We should also update the schema service tutorial/documentation on reserved authority

There are two sections in the Schema Usage Guide, which explain the structure and limitations of the kind values:

Section 6.2.1 Record kind:
https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/Chapters/06-LifecycleProperties.md#612-record-kind
Appendix D 1.3 Schema Identifier kind Limitations:
https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/Chapters/93-OSDU-Schemas.md#appendix-d13-schema-identifier-kind-limitations

Thanks @gehrmann. I wasn't even aware we have such good documentation on Schema & Kind.

These are separate documents/tutorial & not included in core-service documentation. Should these be linked back from Schema service tutorial. It will be pretty useful.

Approved by GC. cc: @andrei_dalhikh

closed

added ADRApproved label and removed ADRProposed label

added KBDone label

mentioned in issue osdu/platform/security-and-compliance/legal#47 (closed)

Field	Pro	Con
authority	it can be precisely filtered those indices	it could cause name conflict among tenants in multi-tenants env when they share the same services
source	it should not cause name conflict among tenants in multi-tenants env if each tenant has its own authority for its kinds	it could be impossible precisely filtered those indices. If the entity type field has the same keyword, those indices will be filtered out too

ADR: Exclude indices of the system/meta data from the search results unless the indices (kinds) of the system/meta data are explicitly specified in the search query

Case 1:

Case 2:

Designs

Child items ...

Activity