Add GSM integration & IT's tweaks
Contents
- bugfixes
- GSM integration
GSM implementation details
The ADR.
You can check out this MR with the similar implementation approach.
Review hint
It's more than 100 files changed, but only 10+ of them contain changes for GSM.
The rest of the files contain changes for IT's mostly so I think you can rely on passing pipeline and not look at these precisely.
You can find GSM-specific changes here as well - only 12 files Draft MR.
About GSM
Introduction
Global Status monitoring is a mechanism to track the status of data journey/dataflows on the data platform. The infrastructure would help in tracking the status of file/data/record ingested through File Service/ Storage API/ Specific DOMS until it is consumed by dependent services.
Status Data Model
Data Model properties help any user to search for status with multiple or specific properties. Every request will be tracked through specific dataSetId
& its associated correlationId
.
Status Data Model is being distributed across multiple tables for tracking whether dataflow has finished or not and if it is Successful or Failed. One table holds DataSet Details and another table holds the overall Status of that dataflow journey.
- DataSet Details - Dataset can be anything that contains data, for e.g., File is one of the types of datasets which contains data inside, File Collection could be another dataset that would contain a set of files.
-
Status - hold the overall status of that data flow.
correlationId
is used as a unique id to capture a single request going through different stages of our Data Platform.
How to publish status and dataset details events
Any service which wants to publish status and dataset details have to follow the below steps:
-
Add
core common lib
as dependency - There are models, classes, and interface defined incore common lib
from Azure. We have to make sure we have selected the right version of the library which includes these classes. - All possible scenarios to publish Status/Dataset Details - It is advised to find out all possible scenarios in which either Status or Dataset Details can be published. A service can publish multiple sets of both Status and Dataset Details.
-
Cloud Implementation to publish Status/Dataset Details - You need to provide an implementation of
IEventPublisher
interface from core common lib. Publish method in this interface accepts an array of Messages and Maps of string attributes. The message is an interface implemented by both Status and Dataset Details. So this method expects an array of either Status or Dataset Details. This method ofIEventPublisher
has to be implemented with cloud-specific codes to publish events instatuschangedtopic
.
Sample of status and dataset details message
The status messages are one of two kinds - DataSet Details and Status, but they are published into the same statuschangedtopic
.
- DataSet Details
[
{
"kind": "datasetDetails",
"properties": {
"correlationId": "12345",
"datasetId": "12345",
"datasetVersionId": "1",
"datasetType": "FILE",
"recordCount": 10,
"timestamp": 1625221800
}
}
]
- Status
[
{
"kind": "status",
"properties": {
"correlationId": "12345",
"recordId": "12334",
"recordIdVersion": "123ff",
"stage": "STORAGE_SYNC",
"status": "FAILED",
"message": "acl is not valid",
"errorCode": 400,
"userEmail": "test@email.com",
"timestamp": 1625221800
}
}
]
Core Common Library contents for GSM
-
Models -
StatusDetails
andDatasetDetails
- These 2 models should be used to publish status and dataset details. -
Utility -
AttributesBuilder
- This will help to create an attributes map which is required in publishing method ofIEventPublisher
to publish status or dataset details. Attributes map will consist ofdata partition id
andcorrelation id
. -
Publisher Interface -
IEventPublisher
- This is the interface that a cloud provider has to implement to produce status and dataset details. It contains a method that accepts the Message array and Attributes maps. The message is an interface implemented by both Status and Dataset Details.
Supported Stages and Statuses
Stages and Services Mapping
Stage | Service |
---|---|
DATASET_SYNC | File Service, Dataset |
INGESTOR | All Ingestors for e.g., CSV, LAS/DLIS/Document |
INGESTOR_SYNC | All Ingestors for e.g., CSV, LAS/DLIS/Document |
STORAGE_SYNC | Storage Service |
ES_SYNC | Indexer Service |
Supported Statuses
Status |
---|
SUBMITTED |
SUCCESS |
FAILED |
IN_PROGRESS |
SKIPPED |
PARTIAL_SUCCESS |