ADR - Metrics and Monitoring
Decision Title
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
The GC Team encountered challenges related to multi-tenancy issues within their monitoring infrastructure in the baremetal. Despite efforts to pinpoint specific issues using the /readiness_check and /liveness_check endpoints, we faced difficulties in isolating problems to particular partitions.
To address this issue effectively, we propose exploring the actuator/metrics and actuator/prometheus endpoints for deeper insights into system performance and tenant-specific metrics.
This proposal aims to improve our ability to diagnose and troubleshoot multi-tenancy issues by leveraging more comprehensive metrics provided by the Actuator framework. By doing so, we can gain better visibility into the behavior of individual partitions and enhance our overall monitoring capabilities.
Decision
Firstly, we'll integrate Actuator and expose the health and Prometheus endpoints. We'll create common endpoints for liveness and readiness to replace the old ones.
management.endpoints.web.exposure.include=health, prometheus
management.health.probes.enabled=true
Next, we'll incorporate a MeterRegistry implementation for Prometheus to create custom and common metrics compatible with Prometheus.
The provided code snippet illustrates how we can utilize metrics to enhance troubleshooting. By capturing HTTP error responses and associating them with specific partition IDs, we can gain valuable insights into system behavior.
@ExceptionHandler(AppException.class)
@ResponseBody
public ResponseEntity<Object> handleInternalError(AppException e) {
ServletRequestAttributes attributes = (ServletRequestAttributes) RequestContextHolder.getRequestAttributes();
HttpServletRequest request = attributes.getRequest();
String fieldValue = request.getHeader("data-partition-id");
countHttpStatus(e.getError().getCode(), fieldValue);
return this.getErrorResponse(e);
}
private void countHttpStatus(int status, String fieldValue){
Counter.builder("http_error_response")
.tags("headerValue", fieldValue)
.tags("httpCode", String.valueOf(status))
.register(metricRegistry)
.increment();
}
This code demonstrates us how we can use metrics and improve a troubleshooting.
In Grafana, we can visualize these metrics using Prometheus as the data source. We can create dashboards that display the count of HTTP error responses, segmented by partition ID and HTTP status code. This visualization enables us to quickly identify patterns and potential issues within our system.
Moreover, Prometheus can provide some metrics to investigate issues:
-
spring_security_authorizations_seconds_count{error="AccessDeniedException"}: This metric can help identify potential DDoS attacks or unauthorized access attempts. It counts the number of authorization requests that resulted in an "AccessDeniedException" error, indicating unauthorized access.
-
jvm_memory_committed_bytes: These metrics provide information about memory usage in the Java Virtual Machine (JVM). They include metrics for both heap and non-heap memory areas, such as "G1 Survivor Space," "G1 Old Gen" (Old Generation), "Metaspace," "CodeHeap 'non-nmethods'," "G1 Eden Space," "Compressed Class Space," and "CodeHeap 'non-profiled nmethods'." Each metric represents the amount of memory committed (allocated) in bytes for a specific memory area. These metrics are valuable for monitoring JVM memory usage and identifying potential memory-related issues, such as memory leaks or excessive memory consumption.
Rationale
Consequences
Some benefits:
- Cost Savings: By moving away from a cloud monitoring system, we can potentially reduce costs associated with cloud service subscriptions or usage-based pricing models. A cloud-agnostic solution may offer more flexibility and cost-effective pricing options.
- Flexibility and Independence: A cloud-agnostic monitoring solution allows you to monitor resources and applications across different cloud providers or on-premises environments. This flexibility reduces vendor lock-in and gives you more control over your monitoring infrastructure.
- Customization and Control: We empower ourselves to proactively monitor system health, identify anomalies, and take timely corrective actions to ensure optimal performance and reliability.
When to revisit
Tradeoff Analysis - Input to decision
Alternatives and implications
Prometheus vs. OpenTelemetry solution
Choosing between Prometheus and OpenTelemetry depends on factors such as whether we need to capture traces, logs, and metrics, our preference for stability or exposure to production, and whether we require a multi-step routing and transformation pipeline.
The general comparison
Prometheus | OpenTelemetry |
---|---|
Observability tool for collection, storage, and query | Offers options for scalability and performance tuning but may introduce additional complexity |
Straightforward model exposed in text | Uses a more intricate series of three models with a binary format for transmission |
Provides metric collection, storage, and query using a scraping system | Provides collection without storage or query |
Uses PromQL for querying | - |
Mature and stable system with its own ecosystem of exporters, integrations, and alerting mechanisms | Supports multiple telemetry backends, including Prometheus |
Known for simplicity, scalability, and efficiency | Benefits from a larger project but may require more production exposure for stability |
Decision criteria and tradeoffs
Prometheus - metrics. Grafana.
OpenTelemetry - traces, logs and metrics. Zipkin, Jaeger.
The main question is what do we want to use?
If monitoring system is Prometheus metrics, if tracing and logs - OpenTelemetry.