Commit 3d00f7f8 authored by Raj Kannan's avatar Raj Kannan 🤝
Browse files


parent 89596253
# Audit and Metrics
This project aims to standardize the procedures for logging, metrics formulation and tracing support in the OSDU platform. This is a cross-cutting concern primarily designed to address the needs for operational rigors and cyber-security of the platform in production setting.
- The project works in close collaboration with the Ops Procedures and Information Security workstreams.
- It embraces the shared responsibility model for operations between platform development, hosting providers and customers (operators)
- The logging and monitoring environment aims to support the system availability, reliability, performance and cyber-security incident detection/tracing requirements.
- It is expected that CSP specific tooling such as Azure Monitor or App Insights or Stack driver would be used by the hosting provider and operators to manage the environment.
- The integration and operations of this is beyond the scope of the current project.
- The project will both standardize and automate logging with the goal of minimizing support effort for OSDU in operations, traceability and protection of OSDU from cyber-security perspective and reduce turnaround time and manual dependencies
## Shared Responsibility Model
| :------------- | :----------: | -----------: | :----------: | -----------: |
| SERVICE LEVEL INDICATORS | Responsible | Accountable | Consulted | Consulted |
| SYSTEM TELEMETRY | Responsible | Accountable | Responsible | Consulted |
| CENTRALIZED LOGGING | Consulted | Accountable | Consulted | Responsible |
| ALERTS & NOTIFICATIONS | Consulted | Accountable | Consulted | Responsible |
| DASHBOARDS | Consulted | Accountable | Consulted | Responsible |
## Project Scope
- The scope of this project is provide the standards and a common code impleemntation for services to report logs, metrics and trace information
- The project will evaluate tools or libraries that can be used to generate and expose logs, metrics and tracing information for consumption.
Note: the system should support the monitoring, reporting and alerting in an aggregated and synchronised view in case the system is build up out of various instances. This might particularly the case for larger operators.
## Definitions
- Logging
> This is what OSDU platform does today - see current status below. The logs helps to track errors and associated data in a centralized way. Different log levels and log types help identify the categories of messages and with a standardized structure is useful to track all requests to the platform and their status.Logging can help with security and debugging, but not really relied upon by SREs and operations engineers.
- Metrics
> A time variant measurement like latency or requests per hour or similar that is correlated to service level indicators which are the basis of user satisfaction with the platform. This is useful to understand the scalability, performance, availability, reliability of the OSDU platform. The metrics could be point values in time, cumulative values or range values over time intervals etc.
- The infrastructure level metrics per the shared responsibility model comes from the CSPs, the platform focuses on metrics specific to the services delivered.
- Tracing
> Tracing helps in cyber-security and operations monitoring to understand a user's journey and the issues that may have been encountered as the logical request is routed thru multiple services within the OSDU platform. By tracing through a stack, Site reliability engineers and developers can identify bottlenecks and focus on improving performance. For example an end to end data flow may begin with the storage service which triggers the indexing service and updates the index and a query satisfied by the search service to retrieve the data that was just loaded into search.
## Current Status
At the moment the system uses the logging library to capture logs from core system services. The current OSDU Data Platform Log Library exposes the following logging methods.
void audit(String logname, AuditPayload payload, Map<String, String> headers);
void request(String logname, Request request, Map<String, String> headers);
void info(String logname, String message, Map<String, String> headers);
void warning(String logname, String message, Map<String, String> headers);
void warning(String logname, String message, Exception ex, Map<String, String> headers);
void error(String logname, String message, Map<String, String> headers);
void error(String logname, String message, Exception ex, Map<String, String> headers);
These however do not help with metrics collection, tracing a call sequence or aiding in auditing of the system without additional capabilities. This project aims to bridge these gaps.
## Possible Technology Choices
As mentioned in the introduction, the goal of this project is to let the CSP specific monitoring tools and dashboards to work with the OSDU platform. Therefore the goal isnt to look at alternatives on the consumption end, but to look for standards for log, metric, trace production, so they can be wired into any of the CSP toolsets. A few initial technical choices that can be evaluated include:
- Prometheus exporters and push gateway for jobs like ingestion/enrichment workflows
- Opencensus - set of libraries to collect metrics and distributed traces
- OpenTelemetry - support for tracing thru addition of OpenTracing with Opencensus
- Brokering solutions like LogStash, Fluentd, ...??
## Useful links
- [Logging Workshop Requirements capture](
- [Operations Procedures for OSDU workstream artifacts](
- [API Logging Requirements for Cyber Security](
- [Platform logging requirements for Cyber Security](
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment