Skip to content

Add GSM integration & IT's tweaks

Maksim Malkov requested to merge m8-csv-parser-mr into master

issue#46 link

Contents

  • bugfixes
  • GSM integration

GSM implementation details

The ADR.
You can check out this MR with the similar implementation approach.

Review hint

It's more than 100 files changed, but only 10+ of them contain changes for GSM.
The rest of the files contain changes for IT's mostly so I think you can rely on passing pipeline and not look at these precisely.
You can find GSM-specific changes here as well - only 12 files Draft MR.

About GSM

Introduction

Global Status monitoring is a mechanism to track the status of data journey/dataflows on the data platform. The infrastructure would help in tracking the status of file/data/record ingested through File Service/ Storage API/ Specific DOMS until it is consumed by dependent services.

Status Data Model

Data Model properties help any user to search for status with multiple or specific properties. Every request will be tracked through specific dataSetId & its associated correlationId.

Status Data Model is being distributed across multiple tables for tracking whether dataflow has finished or not and if it is Successful or Failed. One table holds DataSet Details and another table holds the overall Status of that dataflow journey.

  • DataSet Details - Dataset can be anything that contains data, for e.g., File is one of the types of datasets which contains data inside, File Collection could be another dataset that would contain a set of files.
  • Status - hold the overall status of that data flow. correlationId is used as a unique id to capture a single request going through different stages of our Data Platform.

How to publish status and dataset details events

Any service which wants to publish status and dataset details have to follow the below steps:

  1. Add core common lib as dependency - There are models, classes, and interface defined in core common lib from Azure. We have to make sure we have selected the right version of the library which includes these classes.
  2. All possible scenarios to publish Status/Dataset Details - It is advised to find out all possible scenarios in which either Status or Dataset Details can be published. A service can publish multiple sets of both Status and Dataset Details.
  3. Cloud Implementation to publish Status/Dataset Details - You need to provide an implementation of IEventPublisher interface from core common lib. Publish method in this interface accepts an array of Messages and Maps of string attributes. The message is an interface implemented by both Status and Dataset Details. So this method expects an array of either Status or Dataset Details. This method of IEventPublisher has to be implemented with cloud-specific codes to publish events in statuschangedtopic.

Sample of status and dataset details message

The status messages are one of two kinds - DataSet Details and Status, but they are published into the same statuschangedtopic.

  • DataSet Details
[
  {
    "kind": "datasetDetails",
    "properties": {
      "correlationId": "12345",
      "datasetId": "12345",
      "datasetVersionId": "1",
      "datasetType": "FILE",
      "recordCount": 10,
      "timestamp": 1625221800
    }
  }
]
  • Status
[
  {
    "kind": "status",
    "properties": {
      "correlationId": "12345",
      "recordId": "12334",
      "recordIdVersion": "123ff",
      "stage": "STORAGE_SYNC",
      "status": "FAILED",
      "message": "acl is not valid",
      "errorCode": 400,
      "userEmail": "test@email.com",
      "timestamp": 1625221800
    }
  }
]

Core Common Library contents for GSM

  1. Models - StatusDetails and DatasetDetails - These 2 models should be used to publish status and dataset details.
  2. Utility - AttributesBuilder - This will help to create an attributes map which is required in publishing method of IEventPublisher to publish status or dataset details. Attributes map will consist of data partition id and correlation id.
  3. Publisher Interface - IEventPublisher - This is the interface that a cloud provider has to implement to produce status and dataset details. It contains a method that accepts the Message array and Attributes maps. The message is an interface implemented by both Status and Dataset Details.

Supported Stages and Statuses

Stages and Services Mapping

Stage Service
DATASET_SYNC File Service, Dataset
INGESTOR All Ingestors for e.g., CSV, LAS/DLIS/Document
INGESTOR_SYNC All Ingestors for e.g., CSV, LAS/DLIS/Document
STORAGE_SYNC Storage Service
ES_SYNC Indexer Service

Supported Statuses

Status
SUBMITTED
SUCCESS
FAILED
IN_PROGRESS
SKIPPED
PARTIAL_SUCCESS
Edited by Maksim Malkov

Merge request reports