[ADR] Advanced filters for dataset search
Introduction
We need additional filtering support to be able to filter the POST /dataset/tenant/{tenantid}/subproject/{subprojectid} and PUT /operation/bulk-delete (added in !891 (merged)) operations by metadata fields with more complex expressions than a single key-value match.
Status
-
Initiated -
Proposed -
Under Review -
Approved -
Rejected
Problem statement
The SDMS API POST /dataset/tenant/{tenantid}/subproject/{subprojectid} currently accepts the following body parameters, among others:
-
search, a single SQL-like search parameter, for example:search=name=file% -
gtags, an array of strings matching tags associated with dataset metadata.
The search field does not support more than one field, or more than one possible value for a field.
The SDMS API PUT /operation/bulk-delete (added in !891 (merged)) requires a path parameter containing tenantid, subprojectid and path but does not support filtering by metadata fields or tags.
For both search and delete, we need to be able to filter by more than one field, or more than one possible value for a field.
Furthermore, we expect a need for more complex filter solutions, such as combining AND, OR and NOT operators. The proposed solution should ideally be extensible to support additional expressions and operators in the future if needed.
Proposed solution
Add an optional filter parameter to the POST /dataset/tenant/{tenantid}/subproject/{subprojectid} and PUT /operation/bulk-delete API endpoints.
The search and gtags parameters are to be deprecated.
Overview
The filter parameter can take a payload with a variable format, allowing expressing a simple filter on a single field, as well as logical combinations of filters with arbitrary complexity.
The POST /dataset/tenant/{tenantid}/subproject/{subprojectid} operation has been selected for extension because:
- Advanced metadata filtering, encompassing select and search functionalities, has already been incorporated into that operation.
- The SDMS API also accepts the
GETmethod for the operation with parameters provided in the query string, as a legacy endpoint. ThePOSTversion of the endpoints has been introduced to address issues related to handling large request parameters, where sending the cursor as a query parameter can lead to oversized requests and subsequent failures.
Examples
Example value for the filter parameter:
{
"and": [
{
"not": {
"property": "gtags",
"operator": "CONTAINS",
"value": "tagA"
}
},
{
"or": [
{
"property": "name",
"operator": "LIKE",
"value": "test.%"
},
{
"property": "name",
"operator": "=",
"value": "dataset.sgy"
}
]
}
]
}
This is equivalent to the following pseudo-SQL statement:
SELECT * FROM datasets d WHERE
NOT (EXISTS (SELECT VALUE 1 FROM t IN d.data.gtags WHERE t = 'tagA')
OR (IS_STRING(d.data.gtags) AND STRINGEQUALS(d.data.gtags, 'tagA')))
AND (
d.name LIKE 'test.%'
OR d.name = 'dataset.sgy'
)
Details
The filter parameter can be:
-
A property match filter:
{ "property": "...", "operator": "...", "value": "..." }The implementation will be extensible with additional keys if needed in the future, e.g. to specify case sensitivity.
-
An
andororfilter, i.e. an object containing only the keyandoror, of which the value is an array of one or more filters (i.e. a property match filter or anand,orornotfilter){ "and": [...] } -
A
notfilter, i.e. an object containing only the keynot, of which the value is a filter (i.e. a property match filter or anand,orornotfilter){ "not": ... }
Out of scope / limitations
The operations at GET /utility/ls and POST /utility/ls can also be used for retrieving datasets, but will not be extended with advanced filtering at the moment. That functionality can be added later if required.