File and Dataset Access within DAGs
Creating cloud-agnostic DAGs enables workflows to be written once and run on any OSDU certified platform via the Workflow Service. DAGs (directed acyclic graphs) perform operations via data operators. Ingestion may require interaction with files and datasets, and thus a uniform, cloud-agnostic mechanism must be available to maintain write-once-run-anywhere capability.
This issue opens the discussion on a few topics related to this requirement:
- Interaction with all OSDU services should leverage the OSDU Python SDK
- Where needed, the OSDU Python SDK should be expanded to support additional OSDU services
- All dataset and file interaction should occur via the OSDU Python SDK which serves as a facade to the Dataset service (and File service, though File service will be deprecated at some point)
- Some workflows require local file access (e.g., OpenVDS) - we do not presently have a uniform way to do this that isn't cloud agnostic
- Airflow is the present workflow engine
- Airflow is deployed using containerization, which means the implementation and file system access is likely not uniform across CSP platforms
- Is Local File access something that needs to be made uniform?