Investigating the Side-Effects of AWS Load Balancer Controller Timeouts Due to API Server Throttling
When the AWS Load Balancer (LB) Controller in an EKS cluster times out (e.g., due to API server throttling or communication issues), it cannot properly create or update the Application Load Balancer and its associated target groups. Here’s why that directly impacts your main application pods:
Out-of-Date Load Balancer Rules
The LB Controller periodically reconciles Kubernetes Ingress or Service resources with the AWS ALB. If it times out and cannot complete this reconciliation, any new updates—like pod scaling events, changed ports, or new host/path rules—do not reach the ALB configuration.
Result: Pods might be running and healthy, but the ALB has stale or incomplete routing rules, causing traffic to go to the wrong targets or fail altogether.
Incomplete Target Group Registrations
Each application pod typically registers with an ALB target group. If the LB Controller can’t fully talk to the Kubernetes API (or is throttled and times out), it may fail to add newly launched pods into the target group.
Result: Even though the pods are healthy, the ALB remains unaware of them. Users hitting the load balancer could receive 503/504 errors because the LB has no valid targets or relies on outdated health checks.
Misleading Health Checks
ALBs periodically check the health of pods (via node ports or directly in an NLB/ALB scenario). If the LB Controller times out while updating health-check paths or intervals, the ALB might mislabel healthy pods as unhealthy.
Result: Traffic is dropped or routed only to a smaller subset of pods, leading to uneven load distribution and possible overload of the pods that remain in the rotation.
Deployment or Scaling Disruptions
During rolling updates or horizontal scaling, the LB Controller is responsible for ensuring the ALB routes traffic only to the updated pods. If these changes can’t be communicated (due to LB Controller timeouts), the application’s new replicas may never receive traffic, or old replicas might keep receiving traffic after they should have been decommissioned.
Result: The application experiences downtime or inconsistent availability despite the correct number of pods running.
User-Facing Errors
If the ALB is out of sync with the actual pod states, end-users might encounter higher latencies, errors (4xx/5xx), or timeouts (504 gateway timeouts).
Result: Overall degraded user experience and potential loss of revenue or trust, especially if the system is consumer-facing.
In essence, the LB Controller is the link between your Kubernetes services/pods and the AWS ALB. Timeouts break this link, leaving your load balancer unaware of pod changes and thus unable to correctly route traffic to healthy, available instances.
Here is an article on how I troubleshot a recent issue of this nature, which you can find here.