Add GSM
Contents
- bugfixes
- GSM integration
GSM implementation details
In the current iteration, we implemented both services for publishing STATUS
types of messages. For actual use, we have integrated only the status part according to our architectural inputs.
After this MR will be merged we can expect the following messages will be sent(or at least has an attempt of in case of resource absence):
- Once a user called
triggerWorkflow
service will send GSM withSUBMITTED
status - Once a user called
updateWorkflowRun
service will send GSM with<Any of logically correct and supported>
status retrieved from update payload (we'll send a message only if status was changed)
Currently, we are providing all provide with default dull implementation of Message Sender(IEventPublisher
) it can be easily overridden as we did for Azure provider.
About GSM
Introduction
Global Status monitoring is a mechanism to track the status of data journey/dataflows on the data platform. The infrastructure would help in tracking the status of file/data/record ingested through File Service/ Storage API/ Specific DOMS until it is consumed by dependent services.
Every stage publishes one status message to the message queue. From there Status Collector
picks up messages and normalizes them to store them in persistent storage for future reference. Then Status Processor
provides an API to query and check the status of past datasets.
Status Data Model
Data Model properties help any user to search for status with multiple or specific properties. Every request will be tracked through specific dataSetId
& its associated correlationId
.
Status Data Model is being distributed across multiple tables for tracking whether dataflow has finished or not and if it is Successful or Failed. One table holds DataSet Details and another table holds the overall Status of that dataflow journey.
- DataSet Details - Dataset can be anything that contains data, for e.g., File is one of the types of datasets which contains data inside, File Collection could be another dataset that would contain a set of files.
-
Status - hold the overall status of that data flow.
correlationId
is used as a unique id to capture a single request going through different stages of our Data Platform.
How to publish status and dataset details events
Any service which wants to publish status and dataset details have to follow the below steps:
-
Add
core common lib
as dependency - There are models, classes, and interface defined incore common lib
from Azure. We have to make sure we have selected the right version of the library which includes these classes. - All possible scenarios to publish Status/Dataset Details - It is advised to find out all possible scenarios in which either Status or Dataset Details can be published. A service can publish multiple sets of both Status and Dataset Details.
-
Cloud Implementation to publish Status/Dataset Details - You need to provide an implementation of
IEventPublisher
interface from core common lib. Publish method in this interface accepts an array of Messages and Maps of string attributes. The message is an interface implemented by both Status and Dataset Details. So this method expects an array of either Status or Dataset Details. This method ofIEventPublisher
has to be implemented with cloud-specific codes to publish events instatuschangedtopic
.
Note: We have Azure implementation of Global Status Monitoring. Services that are not part of OSDU AKS cluster have to use /status and /datasetDetails endpoints of Status Processor
service. Status Processor
service will publish status and dataset details in statuschangedtopic
.
Sample of status and dataset details message
The status messages are one of two kinds - DataSet Details and Status, but they are published into the same statuschangedtopic
.
- DataSet Details
[
{
"kind": "datasetDetails",
"properties": {
"correlationId": "12345",
"datasetId": "12345",
"datasetVersionId": "1",
"datasetType": "FILE",
"recordCount": 10,
"timestamp": 1625221800
}
}
]
- Status
[
{
"kind": "status",
"properties": {
"correlationId": "12345",
"recordId": "12334",
"recordIdVersion": "123ff",
"stage": "STORAGE_SYNC",
"status": "FAILED",
"message": "acl is not valid",
"errorCode": 400,
"userEmail": "test@email.com",
"timestamp": 1625221800
}
}
]
Core Common Library contents for GSM
-
Models -
StatusDetails
andDatasetDetails
- These 2 models should be used to publish status and dataset details. -
Utility -
AttributesBuilder
- This will help to create an attributes map which is required in publishing method ofIEventPublisher
to publish status or dataset details. Attributes map will consist ofdata partition id
andcorrelation id
. -
Publisher Interface -
IEventPublisher
- This is the interface that a cloud provider has to implement to produce status and dataset details. It contains a method that accepts the Message array and Attributes maps. The message is an interface implemented by both Status and Dataset Details.
Supported Stages and Statuses
Stages and Services Mapping
Stage | Service |
---|---|
DATASET_SYNC | File Service, Dataset |
INGESTOR | All Ingestors for e.g., CSV, LAS/DLIS/Document |
INGESTOR_SYNC | All Ingestors for e.g., CSV, LAS/DLIS/Document |
WKS_SYNC | All those services that create WKS source records in the Data Platform, for e.g., WKS Transformation Service |
WKE_SYNC | WKE Service |
STORAGE_SYNC | Storage Service |
ES_SYNC | Indexer Service |
Supported Statuses
Status |
---|
SUBMITTED |
SUCCESS |
FAILED |
IN_PROGRESS |
SKIPPED |
PARTIAL_SUCCESS |