ADR: Producing Status Messages

Decision Title

Producing Status Messages to track overall status of dataflow (ingestion + enrichment).

Status

Overview

Currently there is no way to track status of data journey/dataflows on the data platform.

User can interact with Data Platform in multiple ways. One can ingest data using:

Dataset Service
File Service
Storage API
Specific DOMS

There are multiple data/metadata ingestion/discovery/enrichment applications (DAG, Elastic, WKS) which consumes data from data platform to make it available for other products to consume, This means Record which gets ingested through any of the above approach goes through multiple workflows (stages).

This proposal is to come up with a standard format and mechanism of producing status messages that can later be leveraged to track the data journey.

Context & Scope

Data model

We will track every request in Status Data Model through specific dataSetId & its associated correlationId.

Dataflow could have millions of records spread across multiple datasets to ingest in Data Platform. Consumer of this status model should have way to decide

whether dataflow has finished or not
if it is Successful or Failed
if failed then why? what's the reason?
at what stage it is in currently?

dataset details [{ "kind": "dataSetDetails", "properties": { "correlationId": "12345", "dataSetId": "12345", "dataSetVersionId": "1", "dataSetType": "FILE", "recordCount": 10, "timestamp": time() } }]

status [{ "kind": "status", "properties": { "correlationId": "12345", "recordId": "12334", "recordIdVersion": "123ff", "stage": "STORAGE_SYNC", "status": "FAILED", "message": "acl is not valid", "errorCode": 400, "timestamp": time() } }]

Mechanism to produce status messages

Option 1: Publish a Events
- Every Stage (service) publish there own status to message queue, against a dataSetId (Can be file metadata id in case of file Service) and correlationId
- Consumers can simply subscribe to that message queue or notification service to get events of status changes of that datasetId and correlationId
- All our core services are making use of correlation-id and propagate the same for further REST calls and publishing events.
- We can leverage this for tying all related status changed notifications.
- Status data processor service can be built to listen to these status changed events and put into persistent store to make them accessible for future references using querying capabilities.
Option 2: Write in Persistent Store (NoSQL)
- Every Stage (service) writes these status messages in Persistent store (2 tables DatasetDetails, Status)
- By this we can persist all the status events which can be available for events tracking purpose.
- we can build one service which can expose get querying capabilities to get status of any job in past.
- we can have rich querying capabilities for filtering status at any stages and we can filter failures at any level (record or file)
- We can configure the persistent store to trigger events in case any new entry gets added into it as per the requirement if needed.
- No need of status data processor as mentioned in above mechanism.
Options 3: Emit as a Logs
- Instead of publishing events to Pubsub directly services/stages emits logs with same status messages
- we can have log forwarder configured to look for those logs and publish back to message queue again so that consumers can subscribe to that message queue for status changed events.

Decision

Every Service should produce a specific status model
Logic to publish the event would be implemented by CSPs
- We can target CSV workflow as first go to produce the status messages, so services involved in that workflow should start producing messages
- From SLB from we can take care of Azure implementation
  1. File Service
  2. WKS Service
  3. CSV Parser (DAG)
  4. Workflow Service

Rationale

To be able to track the status of ingestion/enrichment of dataset across multiple services.

Edited Dec 02, 2022 by Yurii Pelykh [EPAM / GCP]

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information