Data Ingestion issueshttps://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/issues2023-05-24T15:35:07Zhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/81While performing the same workflow using the same data to ingest reference da...2023-05-24T15:35:07ZKamlesh TodaiWhile performing the same workflow using the same data to ingest reference data type, I see that the new record ids are not getting created each time, Instead the version is getting incrementedWhen, try to ingest data where entity type is master, I see the new record id getting created each time I run the workflow, even though the collection and data file being used are same.
To me this is the expected behavior.
When I try to...When, try to ingest data where entity type is master, I see the new record id getting created each time I run the workflow, even though the collection and data file being used are same.
To me this is the expected behavior.
When I try to do the same with entity type reference, I am seeing that the ids getting generated are same (not new) and the only new version is getting generated.
So e.g. if I get count of the records for the entity type before and after the ingestion, the count remains same except for the first time. (in my case the count goes up by 4 for the first time and then it stays the same as my data has 4 records).
`So I modified the file to have one more record (total 5). When I ran the workflow again with additional one record, I saw the count going up by 1 and not by 5.
Before (inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 9
}
],
"totalCount": 9
}
After (inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 10
}
],
"totalCount": 10
}
Before (again inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 10
}
],
"totalCount": 10
}
After (again inserting 5 records)
{
"results": [
{
"id": "opendes:reference-data--ContractorType:LineClearing"
}
],
"aggregations": [
{
"key": "osdu:wks:reference-data--ContractorType:1.0.0",
"count": 10
}
],
"totalCount": 10
}
`
[CSVWorkflow__CI-CD_v2.0-ReferenceData.postman_collection.json](/uploads/3d0652f4e89be166a525d7ff18731c2d/CSVWorkflow__CI-CD_v2.0-ReferenceData.postman_collection.json)
[ReferenceData.csv](/uploads/a04f12cbeaacb300c4d23788032518ae/ReferenceData.csv) (with f5 records)
The environment file can be gotten from
https://community.opengroup.org/osdu/platform/pre-shipping/-/tree/main/R3-M16/QA_Artifacts_M16/envFilesAndCollections/envFiles
OR
https://community.opengroup.org/osdu/platform/testing/-/tree/master/Postman%20Collection/00_CICD_Setup_Environment
@tdixon @debasisc @chadM18 - Release 0.21https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/56GSM Integration2021-10-06T10:32:28ZFernando Nahu Cantera RubioGSM IntegrationCSV Parser Integration with GSM, now we can get the details of the failure for records as well as jobs for all the CSV Ingestion runs, with proper error message and errorcodesCSV Parser Integration with GSM, now we can get the details of the failure for records as well as jobs for all the CSV Ingestion runs, with proper error message and errorcodesM9 - Release 0.12Fernando Nahu Cantera RubioFernando Nahu Cantera Rubiohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/54Provide simple summary in XCom (from log of DAG)2021-12-26T19:05:28ZDebasis ChatterjeeProvide simple summary in XCom (from log of DAG)Please provide simple summary showing - the list of IDs succeeded and list of records that failed.
(I ran my tests in Azure R3M7 environment). @harshit283
Else, the user (Data Loader) needs to parse through lines and lines of Airflow ...Please provide simple summary showing - the list of IDs succeeded and list of records that failed.
(I ran my tests in Azure R3M7 environment). @harshit283
Else, the user (Data Loader) needs to parse through lines and lines of Airflow log to understand what has worked and what has failed.
You may see sample XCom summary from Manifest-based Ingestion. (GCP/EPAM)
![GCP-Update_Status_Finished_task-XCom-summary](/uploads/62c0e3fe9fb5e57f32c0a4cc7699b8dd/GCP-Update_Status_Finished_task-XCom-summary.PNG)
cc - @ChrisZhang for informationhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/51Support for Nested Array for type-coercion, spatial, meta2021-08-18T05:35:11ZFernando Nahu Cantera RubioSupport for Nested Array for type-coercion, spatial, metaThis is to support the Nested Array for type-coercion, Spatial and meta handlers.
Type-coercion : Data conversion for nested array attributes matching in the schema, default data type is String.
Meta : support for nested array attributes...This is to support the Nested Array for type-coercion, Spatial and meta handlers.
Type-coercion : Data conversion for nested array attributes matching in the schema, default data type is String.
Meta : support for nested array attributes in the FoR block.
Spatial : support for nested array Latitude, Longitude information to create the Spatial Location blockM8 - Release 0.11SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/50Support for Nested Array Schema in CSV Ingestion2021-08-18T05:35:50ZFernando Nahu Cantera RubioSupport for Nested Array Schema in CSV IngestionThis is to support parsing of Nested Array attributes
For Parsing nested Array attributes, we can now have nested array attributes in the CSV example A.[0].B CSV uses the delimiter in the metadata to identify the attributes and create...This is to support parsing of Nested Array attributes
For Parsing nested Array attributes, we can now have nested array attributes in the CSV example A.[0].B CSV uses the delimiter in the metadata to identify the attributes and create them in the ingested recordsM8 - Release 0.11SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/49Support for array type attribute in schema for relationship2021-08-18T05:36:15ZFernando Nahu Cantera RubioSupport for array type attribute in schema for relationshipThe attributes in the schema with x-osdu-relationships key must be array or string type.
This is to support the array type attributes, relationships for multiple parents support.The attributes in the schema with x-osdu-relationships key must be array or string type.
This is to support the array type attributes, relationships for multiple parents support.M8 - Release 0.11SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/48Support for pattern matching in Schema for relationship2021-08-18T05:36:33ZFernando Nahu Cantera RubioSupport for pattern matching in Schema for relationshipOSDU schema support patterns for attributes, patterns may or may not be present in the schema.
If Pattern is present and it matches the Parents id, the relationships is represented in the schema attribute.
It Pattern is present and doesn...OSDU schema support patterns for attributes, patterns may or may not be present in the schema.
If Pattern is present and it matches the Parents id, the relationships is represented in the schema attribute.
It Pattern is present and doesn’t matched the Parents Id, the relationships is represented in the Relationship block.
If Pattern is not present we don’t do any kind of pattern matching.M8 - Release 0.11SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/47Foreign Key - Static relationship with parent IDs in CSV2021-08-18T05:36:54ZFernando Nahu Cantera RubioForeign Key - Static relationship with parent IDs in CSVRelationships gets created with Parents Id provided in the csv file
We Provide the Parent id in one of the csv columns and define it in the metadata file under related natural key, so CSV uses the source column to pick the parents and c...Relationships gets created with Parents Id provided in the csv file
We Provide the Parent id in one of the csv columns and define it in the metadata file under related natural key, so CSV uses the source column to pick the parents and creates the relationship with the ingested childM8 - Release 0.11SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/46As part of GSM feature, Implement status publishing method in MSFT AZURE2021-08-27T21:26:43ZMahesh DakshaAs part of GSM feature, Implement status publishing method in MSFT AZURE
This is as per the GSM requirement to be implemented in each CSP. This issue has been created for Microsoft Aure team to implement the publish method to publish the status events in message queue.
This is as per the GSM requirement to be implemented in each CSP. This issue has been created for Microsoft Aure team to implement the publish method to publish the status events in message queue.M8 - Release 0.11Fernando Nahu Cantera RubioFernando Nahu Cantera Rubiohttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/45As part of GSM feature, Implement status publishing method in GCP2021-10-06T14:29:11ZMahesh DakshaAs part of GSM feature, Implement status publishing method in GCPThis is as per the GSM requirement to be implemented in each CSP. This issue has been created for GCP team to implement the publish method to publish the status events in message queue.This is as per the GSM requirement to be implemented in each CSP. This issue has been created for GCP team to implement the publish method to publish the status events in message queue.M9 - Release 0.12Riabokon Stanislav(EPAM)[GCP]Riabokon Stanislav(EPAM)[GCP]https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/36Not able to inegst large file in IBM CSV ingestion2021-07-20T12:07:14ZShrikant GargNot able to inegst large file in IBM CSV ingestionWe are not able to ingest largse csv file with 100 k records (>25 MB) in IBM CSV Ingestion.
When we open stream from download URL and try to read csv file , stream connection gets closed and only partial records gets ingested(only those ...We are not able to ingest largse csv file with 100 k records (>25 MB) in IBM CSV Ingestion.
When we open stream from download URL and try to read csv file , stream connection gets closed and only partial records gets ingested(only those reocrds which are read till that time).
Error description
'Caused by: java.lang.IllegalStateException: ConnectionClosedException reading next record: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 37,181,768; received: 18,350,080)\n'M7 - Release 0.10Shrikant GargShrikant Garghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/33CSV - Support Airflow 2.02021-09-01T19:20:01ZTodd DixonCSV - Support Airflow 2.0Update DAG in order to support Airflow 2.0 as per the [Airflow 2.0 ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/65).Update DAG in order to support Airflow 2.0 as per the [Airflow 2.0 ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/65).M8 - Release 0.11SwapnilKateryna Kurach (EPAM)Swapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/29CSV Enhancement - Nested arrays2021-08-02T16:03:44ZFernando Nahu Cantera RubioCSV Enhancement - Nested arraysSupport for nested arrays
Examples:
* NestedArrayNaturalKey.[0].WB_NUMBER = 001S0B0
* NestedArrayNaturalKey.[1].WB_NUMBER = 001S0B1
Record:
```
"NestedArrayNaturalKey": [
{
"WB_NUMBER": "001S0B0"
},
{
"WB_NUMBER": "001S0B1"
}
]Support for nested arrays
Examples:
* NestedArrayNaturalKey.[0].WB_NUMBER = 001S0B0
* NestedArrayNaturalKey.[1].WB_NUMBER = 001S0B1
Record:
```
"NestedArrayNaturalKey": [
{
"WB_NUMBER": "001S0B0"
},
{
"WB_NUMBER": "001S0B1"
}
]M8 - Release 0.11SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/26CSV Enhancement - Multithread optimization2021-07-08T16:33:39ZFernando Nahu Cantera RubioCSV Enhancement - Multithread optimization## Multithread optimization
Each record is read and added as a task in an executor service to be enriched and stored parallel with other records.## Multithread optimization
Each record is read and added as a task in an executor service to be enriched and stored parallel with other records.M7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/25CSV Enhancement - Id generation change2021-07-08T16:33:30ZFernando Nahu Cantera RubioCSV Enhancement - Id generation change## Id generation change
Change in the ID generation to follow OSDU pattern ```<authority/data-partition-id>:<source>:<entity-type>:<base64-of-xosdu-natural-keys>```
* authority/data-partition-id is taken from the request triggering the ...## Id generation change
Change in the ID generation to follow OSDU pattern ```<authority/data-partition-id>:<source>:<entity-type>:<base64-of-xosdu-natural-keys>```
* authority/data-partition-id is taken from the request triggering the workflowM7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/24CSV Enhancement - Relationships2021-07-08T16:33:26ZFernando Nahu Cantera RubioCSV Enhancement - Relationships## Relationships
* CSV ingestion supports two kinds of relationships:
1. **Deterministic (Schema-driven)**
These relationships require that the entity be referred to in the record's targetKind schema under an attribute having ```...## Relationships
* CSV ingestion supports two kinds of relationships:
1. **Deterministic (Schema-driven)**
These relationships require that the entity be referred to in the record's targetKind schema under an attribute having ```x-osdu-relationship``` tag. Because they are present in the schema, they are represented directly as attributes in the ```data``` block of the record.
2. **Non Deterministic (Data-driven)**
These relationships do not require any mention in the schema. They are represented within the ```data.relationships``` block of the record.
* ExtensionProperties block in the file metadata record is used to provide additional information for ingestion. We can use this block to provide relationship information. There are three ways of providing this information:
* In the ```relationships``` block, with the entity name and a list of parent record ID(s). The ID(s) provided here are directly used to establish relationships.
* In the ```relatedNaturalKey``` block, as an entity that requires a search of the targetKind using the natural keys provided to establish a relationship.
* _sourceColumn_: Column name of the CSV file which refers to the key parent attribute.
* _targetKind_: Schema ID of the parent record.
* _targetAttribute_: The key attribute of the parent record which is used to search the parent record.
* _**Pre-requisites**_: CSV file should have the key attributes of the parent records.
```
{
"ExtensionProperties": {
"relationships": {
"project": {
"ids": [
"<recordId1>"
]
},
"well": {
"ids": [
"<recordId2>"
"<recordId3>"
]
}
},
"relatedNaturalKey": {
"wellbore": {
"targetKind":"<<authority>:<source>:<entityType>:<version>>",
"keys": [
{
"sourceColumn":"UWI",
"targetAttribute":"uwi"
}
]
}
}
}
}
```
* The schema of the record should have information about attributes that contain deterministic relationships.
* The _EntityType_ field within the ```x-osdu-relationship``` block should contain the entity that needs to be matched from the ExtensionProperties block.
```
{
"properties": {
"wellId": {
"type":"string",
"pattern":"^[\\w\\-\\.]+:\\-\\-well:[\\w\\-\\.\\:\\%]+:[0-9]*$",
"x-osdu-relationship": [
{
"GroupType":"master-data",
"EntityType":"well"
}
]
},
"wellboreId": {
"type":"string",
"pattern":"^[\\w\\-\\.]+:\\-\\-wellbore:[\\w\\-\\.\\:\\%]+:[0-9]*$",
"x-osdu-relationship": [
{
"GroupType":"master-data",
"EntityType":"wellbore"
}
]
}
}
}
```
* The final record will then have the relationships defined as below:
```
{
"data": {
"relationships": {
"project": {
"ids": [
"<recordId1>"
]
}
},
"wellId":"<recordId2>",
"wellboreId":"<recordId5>"
}
}M7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/23CSV Parser Enhancement - Nested Schema2021-07-08T16:33:21ZFernando Nahu Cantera RubioCSV Parser Enhancement - Nested Schema## Nested Schema
* To support the ingestion of data into nested attributes, the headers of the uploaded csv header should match the nested attributes of the target schemas, using the delimiter characters defined on the metadata file.
*...## Nested Schema
* To support the ingestion of data into nested attributes, the headers of the uploaded csv header should match the nested attributes of the target schemas, using the delimiter characters defined on the metadata file.
* The ```nestedFieldDelimiter``` attribute in file metadata is used to define which character is going to be used on the csv file header to describe the different levels of nested attributes while the ingestor parses the files.
* The delimiter character used to define nested structures on the csv file header must match the one defined by the ```nestedFieldDelimiter``` on the file metadata record, otherwise the attributes on the csv file will not be considered nested.
```
{
"ExtensionProperties": {
"FileContentsDetails": {
"TargetKind": "<<authority>:<source>:<entityType>:<version>>",
"nestedFieldDelimiter":".",
"FileType": "csv"
}
}
}M7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/22CSV Parser Enhancement - Spatial data handler2021-07-08T16:33:16ZFernando Nahu Cantera RubioCSV Parser Enhancement - Spatial data handler## Spatial data handler
### Pre-requisities:
* Schema used to ingest the data has Spatial reference.
* CSV file has the Spatial data attributes.
* ExtensionProperties Block is used to provide content details of the file, the Workflow S...## Spatial data handler
### Pre-requisities:
* Schema used to ingest the data has Spatial reference.
* CSV file has the Spatial data attributes.
* ExtensionProperties Block is used to provide content details of the file, the Workflow Service uses this same block to provide Spatial data information.
* SpatialMapping: This section is used to create the Spatial data block in the ingested records.
* type: This field refers to the type of the Spatial data; currently the Workflow Service only supports point.
* latitude: This field refers to the Latitude of the point.
* longitude: This field refers to the Longitude of the point.
```
{
"ExtensionProperties": {
"FileContentsDetails": {
"TargetKind": "<<authority>:<source>:<entityType>:<version>>",
"FileType": "csv",
"SpatialMapping":{
"type": "point",
"latitude": "Column name of the CSV which contains the LATITUDE value",
"longitude": "Column name of the CSV which contains the LONGITUDE value"
},
"FrameOfReference": [
{
"kind": "CRS",
"name": "GCS_WGS_1984",
"persistableReference": "{\"wkt\":\"GEOGCS[\\\"GCS_WGS_1984\\\",DATUM[\\\"D_WGS_1984\\\",SPHEROID[\\\"WGS_1984\\\",6378137.0,298.257223563]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433],AUTHORITY[\\\"EPSG\\\",4326]]\",\"ver\":\"PE_10_3_1\",\"name\":\"GCS_WGS_1984\",\"authCode\":{\"auth\":\"EPSG\",\"code\":\"4326\"},\"type\":\"LBC\"}",
"propertyNames": [
"Column name of the CSV which contains the LATITUDE value",
"Column name of the CSV which contains the LONGITUDE value"
],
"propertyValues": [
"deg"
],
"uncertainty": 0
}
}
}
}M7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/21CSV Parser Enhancement - Token generation for long-running jobs2021-07-08T16:33:07ZFernando Nahu Cantera RubioCSV Parser Enhancement - Token generation for long-running jobs## Token generation for long-running jobs
An interface AuthJwtToken was added for generating tokens, the following classes have dummy implementations for it, and until reworked the request token will be used.
- AwsServiceAccountAuthToke...## Token generation for long-running jobs
An interface AuthJwtToken was added for generating tokens, the following classes have dummy implementations for it, and until reworked the request token will be used.
- AwsServiceAccountAuthToken
- ServiceAccountAuthToken
- IBMServicePrincipalAuthTokenM7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnilhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/20CSV Parser Enhancement - Improvement of search client to escape special chara...2021-07-08T16:33:45ZSwapnilCSV Parser Enhancement - Improvement of search client to escape special characters
## Improvement of search client to escape special characters
Change in the Search Client to escape special character reserved by the Search Service when building queries.
The special characters are: ~ ` ! @ # $ % ^ * ( ) - _ + = { } [ ...
## Improvement of search client to escape special characters
Change in the Search Client to escape special character reserved by the Search Service when building queries.
The special characters are: ~ ` ! @ # $ % ^ * ( ) - _ + = { } [ ] | \ / : ; ' < > , . ?M7 - Release 0.10SwapnilFernando Nahu Cantera RubioSwapnil