[ADR] Hierarchical data distribution statistics based on path - API endpoint
Introduction
We need a solution for retrieving dataset statistics currently consisting of only dataset sizes.
The purpose of this ADR is to define the approach for retrieving the hierarchical data distribution statistics based on a path.
Status
-
Initiated -
Proposed -
Under Review -
Approved -
Rejected
Problem statement
The SDMS API currently exposes the following endpoints for managing the datasets sizes:
-
POST /dataset/tenant/{tenantid}/subproject/{subprojectid}/dataset/{datasetid}/size
- computes the actual dataset size and updates the dataset metadatacomputed_size
field. -
(deprecated)
GET /dataset/tenant/{tenantid}/subproject/{subprojectid}/sizes
- fetches the sizes of the datasets based on the metadata fieldfilemetadata.size
.
Proposed solution
Create new API endpoint for retrieving the total size value for a dataset, a subfolder and a subproject. The new endpoint would require viewer or admin roles.
Overview
GET /dataset/tenant/{tenant}/subproject/{subproject}/size?path={path}&datasetid={datasetname}
Path parameters:
- tenant - tenant
- subproject - subproject
Query parameters:
-
path - folder path for which the analytics are going to be retrieved [mandatory if query parameter
{datasetid}
is specified] - datasetid - dataset name for which the analytics are going to be retrieved
Response:
HTTP 200
{
"dataset_count": 9999,
"size_bytes": 1024
}
- dataset_count - number of datasets under a specific subproject/folder
- size_bytes - sum of sizes [B] of all datasets under a specific subproject/folder or for a specific dataset
Examples:
-
GET /dataset/tenant/tenant1/subproject/subproject1/size
- fetch and sum sizes of all datasets in thesubproject1
-
GET /dataset/tenant/tenant1/subproject/subproject1/size&path=folderA/folderB
- fetch and sum sizes of all datasets under the folder pathfolderA/folderB
in subprojectsubproject1
-
GET /dataset/tenant/tenant1/subproject/subproject1/size&path=folderA/folderB&datasetid=file.txt
- fetch the size of a dataset with a namefile.txt
that resides under the folder pathfolderA/folderB
in subprojectsubproject1
Details
Currently, two fields in the dataset metadata record can store information about the dataset size: filemetadata.size
and computed_size
. filemetadata.size
is being used by the SDK on the client side, computed_size
is intended to be computed and ingested on the server side.
To make sure the chosen field can be a reliable source of truth, the API endpoint implementation will calculate the sum of dataset sizes based on compute_size
field.
Out of scope / limitations
A challenge with using computed_size
field as a source of truth is that some datasets may not have this property calculated, as currently the only way to update this value is by manually calling the Compute Size
POST endpoint.
The solution to ensure the reliability of the value of the computed_size
field will be the subject of a separate ADR.