[ADR] Advanced filters for dataset search
Introduction
We need additional filtering support to be able to filter the POST /dataset/tenant/{tenantid}/subproject/{subprojectid}
and PUT /operation/bulk-delete
(added in !891 (merged)) operations by metadata fields with more complex expressions than a single key-value match.
Status
-
Initiated -
Proposed -
Under Review -
Approved -
Rejected
Problem statement
The SDMS API POST /dataset/tenant/{tenantid}/subproject/{subprojectid}
currently accepts the following body parameters, among others:
-
search
, a single SQL-like search parameter, for example:search=name=file%
-
gtags
, an array of strings matching tags associated with dataset metadata.
The search
field does not support more than one field, or more than one possible value for a field.
The SDMS API PUT /operation/bulk-delete
(added in !891 (merged)) requires a path
parameter containing tenantid
, subprojectid
and path
but does not support filtering by metadata fields or tags.
For both search and delete, we need to be able to filter by more than one field, or more than one possible value for a field.
Furthermore, we expect a need for more complex filter solutions, such as combining AND
, OR
and NOT
operators. The proposed solution should ideally be extensible to support additional expressions and operators in the future if needed.
Proposed solution
Add an optional filter
parameter to the POST /dataset/tenant/{tenantid}/subproject/{subprojectid}
and PUT /operation/bulk-delete
API endpoints.
The search
and gtags
parameters are to be deprecated.
Overview
The filter
parameter can take a payload with a variable format, allowing expressing a simple filter on a single field, as well as logical combinations of filters with arbitrary complexity.
The POST /dataset/tenant/{tenantid}/subproject/{subprojectid}
operation has been selected for extension because:
- Advanced metadata filtering, encompassing select and search functionalities, has already been incorporated into that operation.
- The SDMS API also accepts the
GET
method for the operation with parameters provided in the query string, as a legacy endpoint. ThePOST
version of the endpoints has been introduced to address issues related to handling large request parameters, where sending the cursor as a query parameter can lead to oversized requests and subsequent failures.
Examples
Example value for the filter
parameter:
{
"and": [
{
"not": {
"property": "gtags",
"operator": "CONTAINS",
"value": "tagA"
}
},
{
"or": [
{
"property": "name",
"operator": "LIKE",
"value": "test.%"
},
{
"property": "name",
"operator": "=",
"value": "dataset.sgy"
}
]
}
]
}
This is equivalent to the following pseudo-SQL statement:
SELECT * FROM datasets d WHERE
NOT (EXISTS (SELECT VALUE 1 FROM t IN d.data.gtags WHERE t = 'tagA')
OR (IS_STRING(d.data.gtags) AND STRINGEQUALS(d.data.gtags, 'tagA')))
AND (
d.name LIKE 'test.%'
OR d.name = 'dataset.sgy'
)
Details
The filter
parameter can be:
-
A property match filter:
{ "property": "...", "operator": "...", "value": "..." }
The implementation will be extensible with additional keys if needed in the future, e.g. to specify case sensitivity.
-
An
and
oror
filter, i.e. an object containing only the keyand
oror
, of which the value is an array of one or more filters (i.e. a property match filter or anand
,or
ornot
filter){ "and": [...] }
-
A
not
filter, i.e. an object containing only the keynot
, of which the value is a filter (i.e. a property match filter or anand
,or
ornot
filter){ "not": ... }
Out of scope / limitations
The operations at GET /utility/ls
and POST /utility/ls
can also be used for retrieving datasets, but will not be extended with advanced filtering at the moment. That functionality can be added later if required.