Expand the list of valid dataset kinds through configuration to prevent validation failure for EDS datasets
Problem Statement
The validate_dataset method in validate_file_source.py is called during ingestion and validates acceptable dataset types. The logic appears to be hard-coded to only accept Datasets of kind dataset--File and dataset--FileCollection. Any other Dataset type, like the new External Data Services (EDS) ConnectedSource.Generic.0.2.0 in proposal fails validation.
class DatasetType:
FILE = ":dataset--File."
FILE_COLLECTION = ":dataset--FileCollection."
def _validate_dataset(self, dataset: dict) -> dict:
"""
:param dataset: A dataset to be validated
:return: Dataset
"""
is_file = DatasetType.FILE in dataset.get("kind", "")
is_file_collection = DatasetType.FILE_COLLECTION in dataset.get("kind", "")
is_valid_dataset = False
Impact
Manifests generated from EDS workflows fail.
Options for Resolution
- Read acceptable dataset types from a configuration file (best) or
- Add CONNECTED_SOURCE = ":dataset--ConnectedSource" + logic in the validate_dataset to support EDS datasets. The EDS team has made this type of change locally and it is working successfully in our AWS EDS dev environment. I've asked an EDS dev to create a branch with the change for ingestion team review.