Optimize OSDU Helm Charts for Production Deployment - Resource Efficiency and Stability

Overview

This MR implements comprehensive optimizations for OSDU Azure Helm charts to address resource inefficiency, HPA false scaling, and startup reliability issues. These changes were tested and validated on Standard_E4s_v3 Azure VMs with cluster autoscaling configured for Min: 14 nodes, Max: 28 nodes.

Issues Addressed

1. Resource Over-Provisioning

  • Problem: Default memory requests (5Gi) and limits (8Gi) were significantly over-allocated for Spring Boot services
  • Impact: Cluster utilization at ~40%, high costs, poor pod density (8 pods/node)
  • Root Cause: Generic resource defaults not optimized for JVM-based microservices

2. HPA False Scaling During Startup

  • Problem: Spring Boot services consume 1000m+ CPU for 60-80s during startup, triggering unnecessary HPA scaling
  • Impact: 15-20 false scaling events per deployment, resource waste, operational complexity
  • Root Cause: 60% CPU threshold too sensitive for Spring Boot startup patterns

3. Inconsistent Probe Timing

  • Problem: Probe delays varied widely (40s-200s) across services, causing startup failures
  • Impact: SIGTERM restarts, deployment delays, service unavailability
  • Root Cause: Probe delays not aligned with actual Spring Boot startup times

4. Inefficient OPA Service Configuration

  • Problem: OPA service over-provisioned (3 replicas, 2GB memory) for lightweight policy engine
  • Impact: Unnecessary resource consumption for non-critical service

Changes Implemented

Resource Optimization (osdu-partition_base/values.yaml)

# Before → After
defaultCpuRequests: "0.5" → "600m"     (+20% for startup stability)
defaultMemoryRequests: "5Gi" → "1Gi"   (-80% based on actual usage)
defaultCpuLimits: "1" → "1200m"        (+20% to prevent throttling)
defaultMemoryLimits: "8Gi" → "3Gi"     (-62.5% while preventing OOM)

HPA Configuration Optimization

# Optimized for Spring Boot startup behavior
cpuAvgUtilization: 60% → 75%                    # Prevent startup-induced scaling
maxReplicas: 20 → 10                            # Cost control
scaleDownStabilizationSeconds: 300s → 450s      # Conservative scale-down
scaleUpStabilizationSeconds: 120s → 180s        # Allow startup completion

Probe Timing Standardization

  • Legal Service: 40s → 60s (accounts for CPU throttling)
  • Policy Service: 180s → 60s (was over-conservative)
  • Partition Service: 45s → 60s (startup consistency)
  • Notification/Storage/Schema: Increased to 100s (higher complexity services)

OPA Service Right-Sizing

replicaCount: 3 → 2                             # Cost efficiency
cpu: 500m → 50m (requests), 2000m → 200m (limits)
memory: 2Gi → 128Mi (requests), 8Gi → 512Mi (limits)

Validation Results

Resource Utilization

  • Before: 40% cluster utilization, 8 pods/node average
  • After: 75% cluster utilization, 12 pods/node average
  • Memory Efficiency: 80% reduction in baseline allocation
  • Cost Impact: 60-70% reduction in resource costs

HPA Behavior

  • Before: 15-20 false scaling events per deployment
  • After: 1-2 legitimate scaling events per deployment
  • Startup Scaling: Eliminated startup-induced false scaling

Service Reliability

  • Before: 5-8 SIGTERM events during core service deployment
  • After: 0 SIGTERM events during normal operations
  • Probe Success: 100% success rate with optimized delays

Cluster Capacity

  • Peak Usage: 16-20 nodes during full service scaling
  • Scaling Headroom: 40% additional capacity for traffic spikes
  • Autoscaler Efficiency: Responsive scaling without over-provisioning

Testing Environment

Infrastructure:

  • VM Type: Standard_E4s_v3 (4 vCPU, 32 GiB RAM)
  • Cluster Scaling: Min 14 nodes, Max 28 nodes
  • Node Pools: services + internal for workload separation

Files Modified

File Primary Changes
osdu-azure/osdu-partition_base/values.yaml Core resource limit optimizations
osdu-helm-library/templates/_hpa.yaml HPA defaults for Spring Boot services
osdu-azure/legal/values.yaml Probe timing standardization
osdu-azure/policy/values.yaml Probe timing standardization
osdu-azure/partition/values.yaml Probe timing + documentation
osdu-azure/osdu-opa/ Resource right-sizing + nodepool support
osdu-helm-library/README.md Comprehensive HPA documentation

Backward Compatibility

  • All changes maintain existing functionality
  • No breaking changes to APIs or service behavior
  • Resource limits ensure security boundaries are preserved
  • Configuration values can be overridden per environment

Risk Assessment

Low Risk:

  • Changes based on observed production patterns
  • Conservative resource limits prevent resource starvation
  • Gradual scaling policies prevent service disruption
  • Rollback available through standard Helm procedures

Merge request reports

Loading