Optimize OSDU Helm Charts for Production Deployment - Resource Efficiency and Stability
Overview
This MR implements comprehensive optimizations for OSDU Azure Helm charts to address resource inefficiency, HPA false scaling, and startup reliability issues. These changes were tested and validated on Standard_E4s_v3 Azure VMs with cluster autoscaling configured for Min: 14 nodes, Max: 28 nodes.
Issues Addressed
1. Resource Over-Provisioning
- Problem: Default memory requests (5Gi) and limits (8Gi) were significantly over-allocated for Spring Boot services
- Impact: Cluster utilization at ~40%, high costs, poor pod density (8 pods/node)
- Root Cause: Generic resource defaults not optimized for JVM-based microservices
2. HPA False Scaling During Startup
- Problem: Spring Boot services consume 1000m+ CPU for 60-80s during startup, triggering unnecessary HPA scaling
- Impact: 15-20 false scaling events per deployment, resource waste, operational complexity
- Root Cause: 60% CPU threshold too sensitive for Spring Boot startup patterns
3. Inconsistent Probe Timing
- Problem: Probe delays varied widely (40s-200s) across services, causing startup failures
- Impact: SIGTERM restarts, deployment delays, service unavailability
- Root Cause: Probe delays not aligned with actual Spring Boot startup times
4. Inefficient OPA Service Configuration
- Problem: OPA service over-provisioned (3 replicas, 2GB memory) for lightweight policy engine
- Impact: Unnecessary resource consumption for non-critical service
Changes Implemented
Resource Optimization (osdu-partition_base/values.yaml)
# Before → After
defaultCpuRequests: "0.5" → "600m" (+20% for startup stability)
defaultMemoryRequests: "5Gi" → "1Gi" (-80% based on actual usage)
defaultCpuLimits: "1" → "1200m" (+20% to prevent throttling)
defaultMemoryLimits: "8Gi" → "3Gi" (-62.5% while preventing OOM)
HPA Configuration Optimization
# Optimized for Spring Boot startup behavior
cpuAvgUtilization: 60% → 75% # Prevent startup-induced scaling
maxReplicas: 20 → 10 # Cost control
scaleDownStabilizationSeconds: 300s → 450s # Conservative scale-down
scaleUpStabilizationSeconds: 120s → 180s # Allow startup completion
Probe Timing Standardization
- Legal Service: 40s → 60s (accounts for CPU throttling)
- Policy Service: 180s → 60s (was over-conservative)
- Partition Service: 45s → 60s (startup consistency)
- Notification/Storage/Schema: Increased to 100s (higher complexity services)
OPA Service Right-Sizing
replicaCount: 3 → 2 # Cost efficiency
cpu: 500m → 50m (requests), 2000m → 200m (limits)
memory: 2Gi → 128Mi (requests), 8Gi → 512Mi (limits)
Validation Results
Resource Utilization
- Before: 40% cluster utilization, 8 pods/node average
- After: 75% cluster utilization, 12 pods/node average
- Memory Efficiency: 80% reduction in baseline allocation
- Cost Impact: 60-70% reduction in resource costs
HPA Behavior
- Before: 15-20 false scaling events per deployment
- After: 1-2 legitimate scaling events per deployment
- Startup Scaling: Eliminated startup-induced false scaling
Service Reliability
- Before: 5-8 SIGTERM events during core service deployment
- After: 0 SIGTERM events during normal operations
- Probe Success: 100% success rate with optimized delays
Cluster Capacity
- Peak Usage: 16-20 nodes during full service scaling
- Scaling Headroom: 40% additional capacity for traffic spikes
- Autoscaler Efficiency: Responsive scaling without over-provisioning
Testing Environment
Infrastructure:
- VM Type: Standard_E4s_v3 (4 vCPU, 32 GiB RAM)
- Cluster Scaling: Min 14 nodes, Max 28 nodes
- Node Pools: services + internal for workload separation
Files Modified
| File | Primary Changes |
|---|---|
osdu-azure/osdu-partition_base/values.yaml |
Core resource limit optimizations |
osdu-helm-library/templates/_hpa.yaml |
HPA defaults for Spring Boot services |
osdu-azure/legal/values.yaml |
Probe timing standardization |
osdu-azure/policy/values.yaml |
Probe timing standardization |
osdu-azure/partition/values.yaml |
Probe timing + documentation |
osdu-azure/osdu-opa/ |
Resource right-sizing + nodepool support |
osdu-helm-library/README.md |
Comprehensive HPA documentation |
Backward Compatibility
- All changes maintain existing functionality
- No breaking changes to APIs or service behavior
- Resource limits ensure security boundaries are preserved
- Configuration values can be overridden per environment
Risk Assessment
Low Risk:
- Changes based on observed production patterns
- Conservative resource limits prevent resource starvation
- Gradual scaling policies prevent service disruption
- Rollback available through standard Helm procedures