[ADR] Advanced filters for dataset search

Introduction

We need additional filtering support to be able to filter the POST /dataset/tenant/{tenantid}/subproject/{subprojectid} and PUT /operation/bulk-delete (added in !891 (merged)) operations by metadata fields with more complex expressions than a single key-value match.

Status

  • Initiated
  • Proposed
  • Under Review
  • Approved
  • Rejected

Problem statement

The SDMS API POST /dataset/tenant/{tenantid}/subproject/{subprojectid} currently accepts the following body parameters, among others:

  • search, a single SQL-like search parameter, for example: search=name=file%
  • gtags, an array of strings matching tags associated with dataset metadata.

The search field does not support more than one field, or more than one possible value for a field.

The SDMS API PUT /operation/bulk-delete (added in !891 (merged)) requires a path parameter containing tenantid, subprojectid and path but does not support filtering by metadata fields or tags.

For both search and delete, we need to be able to filter by more than one field, or more than one possible value for a field.

Furthermore, we expect a need for more complex filter solutions, such as combining AND, OR and NOT operators. The proposed solution should ideally be extensible to support additional expressions and operators in the future if needed.

Proposed solution

Add an optional filter parameter to the POST /dataset/tenant/{tenantid}/subproject/{subprojectid} and PUT /operation/bulk-delete API endpoints.

The search and gtags parameters are to be deprecated.

Overview

The filter parameter can take a payload with a variable format, allowing expressing a simple filter on a single field, as well as logical combinations of filters with arbitrary complexity.

The POST /dataset/tenant/{tenantid}/subproject/{subprojectid} operation has been selected for extension because:

  • Advanced metadata filtering, encompassing select and search functionalities, has already been incorporated into that operation.
  • The SDMS API also accepts the GET method for the operation with parameters provided in the query string, as a legacy endpoint. The POST version of the endpoints has been introduced to address issues related to handling large request parameters, where sending the cursor as a query parameter can lead to oversized requests and subsequent failures.

Examples

Example value for the filter parameter:

{
  "and": [
    {
      "not": {
        "property": "gtags",
        "operator": "CONTAINS",
        "value": "tagA"
      }
    },
    {
      "or": [
        {
          "property": "name",
          "operator": "LIKE",
          "value": "test.%"
        },
        {
          "property": "name",
          "operator": "=",
          "value": "dataset.sgy"
        }
      ]
    }
  ]
}

This is equivalent to the following pseudo-SQL statement:

SELECT * FROM datasets d WHERE
NOT (EXISTS (SELECT VALUE 1 FROM t IN d.data.gtags WHERE t = 'tagA') 
   OR (IS_STRING(d.data.gtags) AND STRINGEQUALS(d.data.gtags, 'tagA')))
AND (
  d.name LIKE 'test.%'
  OR d.name = 'dataset.sgy'
)

Details

The filter parameter can be:

  • A property match filter:

    {
      "property": "...",
      "operator": "...",
      "value":  "..."
    }

    The implementation will be extensible with additional keys if needed in the future, e.g. to specify case sensitivity.

  • An and or or filter, i.e. an object containing only the key and or or, of which the value is an array of one or more filters (i.e. a property match filter or an and, or or not filter)

    {
      "and": [...]
    }
  • A not filter, i.e. an object containing only the key not, of which the value is a filter (i.e. a property match filter or an and, or or not filter)

    {
      "not": ...
    }

Out of scope / limitations

The operations at GET /utility/ls and POST /utility/ls can also be used for retrieving datasets, but will not be extended with advanced filtering at the moment. That functionality can be added later if required.

Edited by Laura Damian