Skip to content

M7 csv parser

Swapnil requested to merge m6-csv-parser into master

List of new features and improvements

Token generation for long-running jobs

An interface AuthJwtToken was added for generating tokens, the following classes have dummy implementations for it, and until reworked the request token will be used.

  • AwsServiceAccountAuthToken
  • ServiceAccountAuthToken
  • IBMServicePrincipalAuthToken

Spatial data handler

Pre-requisities:

  • Schema used to ingest the data has Spatial reference.
  • CSV file has the Spatial data attributes.
  • ExtensionProperties Block is used to provide content details of the file, the Workflow Service uses this same block to provide Spatial data information.
  • SpatialMapping: This section is used to create the Spatial data block in the ingested records.
    • type: This field refers to the type of the Spatial data; currently the Workflow Service only supports point.
    • latitude: This field refers to the Latitude of the point.
    • longitude: This field refers to the Longitude of the point.
{
    "ExtensionProperties": {
        "FileContentsDetails": {
            "TargetKind": "<<authority>:<source>:<entityType>:<version>>",
            "FileType": "csv",
            "SpatialMapping":{
            "type": "point",
            "latitude": "Column name of the CSV which contains the LATITUDE value",
            "longitude": "Column name of the CSV which contains the LONGITUDE value"
        },
        "FrameOfReference": [
            {
                "kind": "CRS",
                "name": "GCS_WGS_1984",
                "persistableReference": "{\"wkt\":\"GEOGCS[\\\"GCS_WGS_1984\\\",DATUM[\\\"D_WGS_1984\\\",SPHEROID[\\\"WGS_1984\\\",6378137.0,298.257223563]],PRIMEM[\\\"Greenwich\\\",0.0],UNIT[\\\"Degree\\\",0.0174532925199433],AUTHORITY[\\\"EPSG\\\",4326]]\",\"ver\":\"PE_10_3_1\",\"name\":\"GCS_WGS_1984\",\"authCode\":{\"auth\":\"EPSG\",\"code\":\"4326\"},\"type\":\"LBC\"}",
                "propertyNames": [
                    "Column name of the CSV which contains the LATITUDE value",
                    "Column name of the CSV which contains the LONGITUDE value"
                ],
                "propertyValues": [
                    "deg"
                ],
                "uncertainty": 0
            }
      } 
    }
}

Nested Schema

  • To support the ingestion of data into nested attributes, the headers of the uploaded csv header should match the nested attributes of the target schemas, using the delimiter characters defined on the metadata file.
  • The nestedFieldDelimiter attribute in file metadata is used to define which character is going to be used on the csv file header to describe the different levels of nested attributes while the ingestor parses the files.
  • The delimiter character used to define nested structures on the csv file header must match the one defined by the nestedFieldDelimiter on the file metadata record, otherwise the attributes on the csv file will not be considered nested.
{
    "ExtensionProperties": {
        "FileContentsDetails": {
            "TargetKind": "<<authority>:<source>:<entityType>:<version>>",
            "nestedFieldDelimiter":".",
            "FileType": "csv"                
        }
    }
}

Relationships

  • CSV ingestion supports two kinds of relationships:

    1. Deterministic (Schema-driven) These relationships require that the entity be referred to in the record's targetKind schema under an attribute having x-osdu-relationship tag. Because they are present in the schema, they are represented directly as attributes in the data block of the record.

    2. Non Deterministic (Data-driven) These relationships do not require any mention in the schema. They are represented within the data.relationships block of the record.

  • ExtensionProperties block in the file metadata record is used to provide additional information for ingestion. We can use this block to provide relationship information. There are three ways of providing this information:

    • In the relationships block, with the entity name and a list of parent record ID(s). The ID(s) provided here are directly used to establish relationships.

    • In the relatedNaturalKey block, as an entity that requires a search of the targetKind using the natural keys provided to establish a relationship.

      • sourceColumn: Column name of the CSV file which refers to the key parent attribute.
      • targetKind: Schema ID of the parent record.
      • targetAttribute: The key attribute of the parent record which is used to search the parent record.
      • Pre-requisites: CSV file should have the key attributes of the parent records.
    {
        "ExtensionProperties": {
            "relationships": {
                "project": {
                    "ids": [
                        "<recordId1>"
                    ]
                },
                "well": {
                    "ids": [
                        "<recordId2>"
                        "<recordId3>"
                    ]
                }
            },
            "relatedNaturalKey": {
                "wellbore": {
                    "targetKind":"<<authority>:<source>:<entityType>:<version>>",
                    "keys": [
                        {
                            "sourceColumn":"UWI",
                            "targetAttribute":"uwi"
                        }
                    ]
                }
            }
        }
    }
  • The schema of the record should have information about attributes that contain deterministic relationships.
    • The EntityType field within the x-osdu-relationship block should contain the entity that needs to be matched from the ExtensionProperties block.
   {
       "properties": {
            "wellId": {
                "type":"string",
                "pattern":"^[\\w\\-\\.]+:\\-\\-well:[\\w\\-\\.\\:\\%]+:[0-9]*$",
                "x-osdu-relationship": [
                    {
                        "GroupType":"master-data",
                        "EntityType":"well"
                    }
                ]

            },
            "wellboreId": {
                "type":"string",
                "pattern":"^[\\w\\-\\.]+:\\-\\-wellbore:[\\w\\-\\.\\:\\%]+:[0-9]*$",
                "x-osdu-relationship": [
                    {
                        "GroupType":"master-data",
                        "EntityType":"wellbore"
                    }
                ]
            }
       }
   }
  • The final record will then have the relationships defined as below:
{
       "data": {
            "relationships": {
                "project": {
                    "ids": [
                        "<recordId1>"
                    ]
                }
            },
            "wellId":"<recordId2>",
            "wellboreId":"<recordId5>"
       }
} 

Id generation change

Change in the ID generation to follow OSDU pattern <authority/data-partition-id>:<source>:<entity-type>:<base64-of-xosdu-natural-keys>

  • authority/data-partition-id is taken from the request triggering the workflow

Multithread optimization

Each record is read and added as a task in an executor service to be enriched and stored parallel with other records.

Improvement of search client to escape special characters

Change in the Search Client to escape special character reserved by the Search Service when building queries. The special characters are: ~ ` ! @ # $ % ^ * ( ) - _ + = { } [ ] | \ / : ; ' < > , . ?

Edited by Abhishek Kumar

Merge request reports