Top-level issue to track requirements for metrics, tracing and audit logging
Log, Metric Aggregation The OSDU Data platform is a large complex system with multiple services (SLIs) running against multiple Infrastructure components (System Telemetry). Holistic monitoring and event correlation requires aggregating logs and metrics from all of these sources.
- OSDU operations readiness workstream recommends deploying and configuring a Central logging service that is decoupled and isolated from the Data Platform implementation that can act as a simple point of access for filtering, searching, alerting, notification and dashboards.
- Isolation is important to avoid having the logging system subject to the same reliability issues as the OSDU data platform.
Alerts and Notification
- The system should be able to detect slow and fast burning issues based on threshold and trends
Metric Examples
- availability - the % of successful responses
- latency & performance- the % of requests that complete faster than a target
- freshness - the % of data that is up to date
- correctness - the % of request that return the correct result
- durability - the % of data that is recorded that can be read successfully
Overview from Ops Workstream
Please refer to the following links