[ADR] Hierarchical data distribution statistics based on path - API endpoint

Introduction

We need a solution for retrieving dataset statistics currently consisting of only dataset sizes.

The purpose of this ADR is to define the approach for retrieving the hierarchical data distribution statistics based on a path.

Status

Problem statement

The SDMS API currently exposes the following endpoints for managing the datasets sizes:

POST /dataset/tenant/{tenantid}/subproject/{subprojectid}/dataset/{datasetid}/size - computes the actual dataset size and updates the dataset metadata computed_size field.
(deprecated) GET /dataset/tenant/{tenantid}/subproject/{subprojectid}/sizes - fetches the sizes of the datasets based on the metadata field filemetadata.size.

Proposed solution

Create new API endpoint for retrieving the total size value for a dataset, a subfolder and a subproject. The new endpoint would require viewer or admin roles.

Overview

GET /dataset/tenant/{tenant}/subproject/{subproject}/size?path={path}&datasetid={datasetname}

Path parameters:

tenant - tenant
subproject - subproject

Query parameters:

path - folder path for which the analytics are going to be retrieved [mandatory if query parameter {datasetid} is specified]
datasetid - dataset name for which the analytics are going to be retrieved

Response:

HTTP 200

{
  "dataset_count": 9999,
  "size_bytes": 1024
}

dataset_count - number of datasets under a specific subproject/folder
size_bytes - sum of sizes [B] of all datasets under a specific subproject/folder or for a specific dataset

Examples:

GET /dataset/tenant/tenant1/subproject/subproject1/size - fetch and sum sizes of all datasets in the subproject1
GET /dataset/tenant/tenant1/subproject/subproject1/size&path=folderA/folderB - fetch and sum sizes of all datasets under the folder path folderA/folderB in subproject subproject1
GET /dataset/tenant/tenant1/subproject/subproject1/size&path=folderA/folderB&datasetid=file.txt - fetch the size of a dataset with a name file.txt that resides under the folder path folderA/folderB in subproject subproject1

Details

Currently, two fields in the dataset metadata record can store information about the dataset size: filemetadata.size and computed_size. filemetadata.size is being used by the SDK on the client side, computed_size is intended to be computed and ingested on the server side. To make sure the chosen field can be a reliable source of truth, the API endpoint implementation will calculate the sum of dataset sizes based on compute_size field.

Out of scope / limitations

A challenge with using computed_size field as a source of truth is that some datasets may not have this property calculated, as currently the only way to update this value is by manually calling the Compute Size POST endpoint.

The solution to ensure the reliability of the value of the computed_size field will be the subject of a separate ADR.

Edited Dec 14, 2023 by Sneha Poddar