Enable autoscaling
Status
-
Proposed -
Under review -
Approved -
Retired
Context & Scope
The Azure OSDU AKS deployment and the services deployed do not utilize autoscaling. This is due to limitations in AGIC that is used to populate Pod IPs to the Application Gateway Backend pools to route requests.
Application Gateway is slow at updating this information when Pod IPs change e.g. during autoscaling and so can cause high error rates on client requests.
This is a simplified deployment view of the current AKS and Application Gateway setup.
To enable autoscaling we then need to replace AGIC with a different ingress controller technology. Istio Ingress Controller is already being utilized for East West traffic within the cluster. We therefore want to extend this to be used by the North South traffic.
This will allow us to enable cluster autoscaling and Horizontal Pod Scaling (HPA) of services.
Also of note, this is not the end solution as a fork will be made of the OSDU infrastructure to re-design the solution for the needs of the PAAS deployment on Azure. The time frame for this is approximately 6 months and so this should be seen as a temporary solution to enable autoscaling.
Trade-off Analysis
One approach is to expose the Istio Ingress Controller directly to external traffic. This simplifies the architecture and enables TLS for in cluster traffic where as today TLS termination happens at App Gateway and HTTP is used after.
However this would mean we need to replicate both the WAF and telemetry created by the Application Gateway today in the Istio Ingress controller. We would also need to update the monitoring solutions to use the new telemetry.
Although possible, this is all extra work that will take time. Given this is a temporary solution having the optimal implementation is not necessary as long as we can enable autoscaling.
Therefore a compromise solution where we keep the Application Gateway and forward requests to the Istio ingress controller will be simpler to implement and still achieve our goal. We would then keep the Istio ingress controller endpoint private and not expose it to external traffic.
We are also proposing to have a separate node pool in the AKS as currently the system node pool is utilized for the services deployed. This is the recommended best practice to prevent user services compromising critical system resources. This new node pool will be configured to autoscale.
mTLS is required between the ingress controller and app gateway to enforce requests to the cluster can only be routed through app gateway.
Individual services will need to be configured their own HPA based on the services needs after this change is applied.
Below is the simplified deployment view of the new solution.
Decision
- AGIC will be removed
- Istio ingress will be used to route requests to the services deployed in AKS
- Application gateway will remain with a backend pool per service. Each backend pool forwards the request onto the same Istio Ingress Controller
- A new node pool will be added to the AKS deployment. Cluster autoscaling will be enabled on this node pool. All services will be deployed
Future work / Out of scope
- Understand Node limits for autoscaled cluster
- Understand HPA configuration needed for individual services
- Optimizing the configuration of the cluster autoscaler
- Autoscaling the cluster for burst traffic scenarios