ADR - Metrics and Monitoring

Decision Title

Status

Context & Scope

The GC Team encountered challenges related to multi-tenancy issues within their monitoring infrastructure in the baremetal. Despite efforts to pinpoint specific issues using the /readiness_check and /liveness_check endpoints, we faced difficulties in isolating problems to particular partitions.

To address this issue effectively, we propose exploring the actuator/metrics and actuator/prometheus endpoints for deeper insights into system performance and tenant-specific metrics.

This proposal aims to improve our ability to diagnose and troubleshoot multi-tenancy issues by leveraging more comprehensive metrics provided by the Actuator framework. By doing so, we can gain better visibility into the behavior of individual partitions and enhance our overall monitoring capabilities.

Decision

Firstly, we'll integrate Actuator and expose the health and Prometheus endpoints. We'll create common endpoints for liveness and readiness to replace the old ones.

management.endpoints.web.exposure.include=health, prometheus
management.health.probes.enabled=true

Next, we'll incorporate a MeterRegistry implementation for Prometheus to create custom and common metrics compatible with Prometheus.

The provided code snippet illustrates how we can utilize metrics to enhance troubleshooting. By capturing HTTP error responses and associating them with specific partition IDs, we can gain valuable insights into system behavior.

@ExceptionHandler(AppException.class)
    @ResponseBody
    public ResponseEntity<Object> handleInternalError(AppException e) {
        ServletRequestAttributes attributes = (ServletRequestAttributes) RequestContextHolder.getRequestAttributes();
        HttpServletRequest request = attributes.getRequest();
        String fieldValue = request.getHeader("data-partition-id");
        countHttpStatus(e.getError().getCode(), fieldValue);
        return this.getErrorResponse(e);
    }

    private void countHttpStatus(int status, String fieldValue){
        Counter.builder("http_error_response")
                .tags("headerValue", fieldValue)
                .tags("httpCode", String.valueOf(status))
                .register(metricRegistry)
                .increment();
    }

This code demonstrates us how we can use metrics and improve a troubleshooting.

In Grafana, we can visualize these metrics using Prometheus as the data source. We can create dashboards that display the count of HTTP error responses, segmented by partition ID and HTTP status code. This visualization enables us to quickly identify patterns and potential issues within our system.

Moreover, Prometheus can provide some metrics to investigate issues:

spring_security_authorizations_seconds_count{error="AccessDeniedException"}: This metric can help identify potential DDoS attacks or unauthorized access attempts. It counts the number of authorization requests that resulted in an "AccessDeniedException" error, indicating unauthorized access.
jvm_memory_committed_bytes: These metrics provide information about memory usage in the Java Virtual Machine (JVM). They include metrics for both heap and non-heap memory areas, such as "G1 Survivor Space," "G1 Old Gen" (Old Generation), "Metaspace," "CodeHeap 'non-nmethods'," "G1 Eden Space," "Compressed Class Space," and "CodeHeap 'non-profiled nmethods'." Each metric represents the amount of memory committed (allocated) in bytes for a specific memory area. These metrics are valuable for monitoring JVM memory usage and identifying potential memory-related issues, such as memory leaks or excessive memory consumption.

Rationale

Consequences

Some benefits:

Cost Savings: By moving away from a cloud monitoring system, we can potentially reduce costs associated with cloud service subscriptions or usage-based pricing models. A cloud-agnostic solution may offer more flexibility and cost-effective pricing options.
Flexibility and Independence: A cloud-agnostic monitoring solution allows you to monitor resources and applications across different cloud providers or on-premises environments. This flexibility reduces vendor lock-in and gives you more control over your monitoring infrastructure.
Customization and Control: We empower ourselves to proactively monitor system health, identify anomalies, and take timely corrective actions to ensure optimal performance and reliability.

When to revisit

Tradeoff Analysis - Input to decision

Alternatives and implications

Prometheus vs. OpenTelemetry solution

Choosing between Prometheus and OpenTelemetry depends on factors such as whether we need to capture traces, logs, and metrics, our preference for stability or exposure to production, and whether we require a multi-step routing and transformation pipeline.

The general comparison

Prometheus	OpenTelemetry
Observability tool for collection, storage, and query	Offers options for scalability and performance tuning but may introduce additional complexity
Straightforward model exposed in text	Uses a more intricate series of three models with a binary format for transmission
Provides metric collection, storage, and query using a scraping system	Provides collection without storage or query
Uses PromQL for querying	-
Mature and stable system with its own ecosystem of exporters, integrations, and alerting mechanisms	Supports multiple telemetry backends, including Prometheus
Known for simplicity, scalability, and efficiency	Benefits from a larger project but may require more production exposure for stability

Decision criteria and tradeoffs

Prometheus - metrics. Grafana.

OpenTelemetry - traces, logs and metrics. Zipkin, Jaeger.

The main question is what do we want to use?

If monitoring system is Prometheus metrics, if tracing and logs - OpenTelemetry.

Edited May 16, 2024 by Riabokon Stanislav(EPAM)[GCP]

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information