[M18] EDS ingest fails with data of kind dataset--File.Generic when using OSDU instance as a data provider
During our testing EDS for OSDU-to-OSDU sync in M18, we found a bug where after specifying FetchKind as dataset--File.Generic
, EDS Ingest DAG runs successfully, but OSDU_Ingest DAG fails when trying to ingest new record by manifest:
[2023-10-20, 14:31:49 UTC] {{validate_schema.py:320}} ERROR - Error: 'DatasetProperties' is a required property
Failed validating 'required' in schema['properties']['data']['allOf'][1]:
{'$schema': 'http://json-schema.org/draft-07/schema#',
'description': 'Schema fragment holding properties common for all '
'datasets.',
'properties': {'DatasetProperties': {'description': 'Placeholder for '
'a '
'specialization.',
'example': {},
'title': 'Dataset Properties',
'type': 'object'},
'Description': {'description': 'An optional, textual '
'description of the '
'dataset.',
'example': 'As originally delivered by '
'ACME.com.',
'title': 'Description',
'type': 'string'},
'EncodingFormatTypeID': {'description': 'EncodingFormatType '
'ID reference '
'value '
'relationship. '
'It can be a '
'mime-type or '
'media-type.',
'example': 'namespace:reference-data--EncodingFormatType:text%2Fcsv:',
'pattern': '^[\\w\\-\\.]+:reference-data\\-\\-EncodingFormatType:[\\w\\-\\.\\:\\%]+:[0-9]*$',
'title': 'Encoding Format '
'Type ID',
'type': 'string',
'x-osdu-relationship': [{'EntityType': 'EncodingFormatType',
'GroupType': 'reference-data'}]},
'Endian': {'description': 'Endianness of binary '
'value. Enumeration: "BIG", '
'"LITTLE". If absent, '
'applications will need to '
'interpret from context '
'indicators.',
'enum': ['BIG', 'LITTLE'],
'type': 'string'},
'Name': {'description': 'An optional name of the '
'dataset, e.g. a user friendly '
'file or file collection name.',
'example': 'Dataset X221/15',
'title': 'Name',
'type': 'string'},
'SchemaFormatTypeID': {'description': 'Relationship to '
'the '
'SchemaFormatType '
'reference '
'value.',
'example': 'namespace:reference-data--SchemaFormatType:CWLS%20LAS3:',
'pattern': '^[\\w\\-\\.]+:reference-data\\-\\-SchemaFormatType:[\\w\\-\\.\\:\\%]+:[0-9]*$',
'title': 'Schema Format Type ID',
'type': 'string',
'x-osdu-relationship': [{'EntityType': 'SchemaFormatType',
'GroupType': 'reference-data'}]},
'TotalSize': {'description': 'Total size of the '
'dataset in bytes; for '
'files it is the same as '
'declared in '
'FileSourceInfo.FileSize '
'or the sum of all '
'individual files. '
'Implemented as string. '
'The value must be '
'convertible to a long '
'integer (sizes can '
'become very large).',
'example': 13245217273,
'format': 'integer',
'pattern': '^[0-9]+$',
'title': 'Total Size',
'type': 'string'}},
'required': ['DatasetProperties'],
'title': 'AbstractDataset',
'type': 'object',
'x-osdu-inheriting-from-kind': [],
'x-osdu-license': 'Copyright 2022, The Open Group \\nLicensed under '
'the Apache License, Version 2.0 (the "License"); '
'you may not use this file except in compliance '
'with the License. You may obtain a copy of the '
'License at '
'http://www.apache.org/licenses/LICENSE-2.0 . '
'Unless required by applicable law or agreed to in '
'writing, software distributed under the License is '
'distributed on an "AS IS" BASIS, WITHOUT '
'WARRANTIES OR CONDITIONS OF ANY KIND, either '
'express or implied. See the License for the '
'specific language governing permissions and '
'limitations under the License.',
'x-osdu-review-status': 'Accepted',
'x-osdu-schema-source': 'osdu:wks:AbstractDataset:1.0.0'}
On instance['data']:
{'DatasetProperties.FileSourceInfo.FileSource': 's3://***/r1/data/provided/markers/1072.csv',
'DatasetProperties.FileSourceInfo.Name': '1072.csv',
'DatasetProperties.FileSourceInfo.PreloadFilePath': 's3://***/r1/data/provided/markers/1072.csv',
'NameAliases': [{'AliasName': ':dataset--File.Generic:5a36f2ff77ccc189a4578044e2974cd011ede43457575279ace5dc33b00937e5',
'AliasNameTypeID': 'osdu:reference-data--AliasNameType:EDSConnectedSourceIdentifier:'}],
'ResourceSecurityClassification': 'osdu:reference-data--ResourceSecurityClassification:RESTRICTED:',
'SchemaFormatTypeID': 'osdu:reference-data--SchemaFormatType:TabSeparatedColumnarText:',
'Source': 'osdu:master-data--Organisation:AWS-PRESHIP:'}
The reason for this problem appears to be that Search service returns flattened dictionary in response's data field:
{
"data": {
"DatasetProperties.FileSourceInfo.FileSource": "***",
"DatasetProperties.FileSourceInfo.Name": "***",
"DatasetProperties.FileSourceInfo.PreloadFilePath": "***"
}
...
}
Additionally, after we tried taking raw response from Storage GET records endpoint and ingesting it, it was also lacking some fields(see picture below of comparison of Search vs Storage responses):