Upgrading getSignedUrl api to use newly created containers for data storage
Overview:
Currently we only have a single container where the data is being stored for all the dag runs. The data sharing across tasks is being done by generating SAS tokens at container level. This gives the access to any dag run to access the data from any other dag runs as well (which is the major concern to change the existing infrastructure).
This will be handled by creating a storage account where a new container will be created every time a dag run is triggered. The sas token will then be generated for the newly created container (which is created dedicatedly to store the data for this new dag run) and hence restricting the access.
The current behaviour will be to create new containers in the storage account when we hit getSignedUrl api and generate the sas tokens corresponding to the newly created containers.
There are following queries that needs to be resolved -
-
Currently new containers are created on the fly when we are hitting the getSignedUrl endpoint. What should be the behaviour in case a request is repeated for the same workflowId and runId? (of course it doesn't make sense to again create a new container)
-
We can't create containers using workflowId as the name of the container (contains special characters in the name which are not allowed). How should we go about this? Should we create the container with runId or uuid? Or probably we should map the workflowId to a new uuid, store that in a database and then create the container with this new uuid?
Notes:
- The implementation for container creation currently resides in the service code (ingestion). Going forward, once we finalize on the architectural changes, this will be moved to the azure core lib (blobStore).
cc: @kibattul