Investigating AWS Load Balancer Controller Timeouts Due to API Server Throttling

While investigating time-out errors in the AWS Load Balancer Controller logs, I observed connection failures when attempting to communicate with the Kubernetes API server. Upon further inspection of the API server logs in Amazon CloudWatch, I noticed that a majority of pods in the kube-system namespace were experiencing HTTP 429 (Too Many Requests) errors.

Given the nature of the errors and their widespread occurrence across system components, I suspected that the issue was related to API server priority and fairness (APF), which governs request handling and rate limits for different Kubernetes workloads.

The objective was to identify which workloads were generating excessive API requests, understand their impact on the API server, communicate findings to the customer, and implement an appropriate resolution to restore cluster stability.

1. Collected AWS Load Balancer Controller Logs

I started by querying CloudWatch Logs Insights to find time-out errors in the AWS Load Balancer Controller logs:

fields @timestamp, @message

| filter @message like "time-out"

| sort @timestamp desc

| limit 50

This confirmed that the Load Balancer Controller was experiencing repeated i/o timeout errors when attempting to reach the API

server.

2.Checked API Server Logs for 429 Errors
Since 429 errors indicate API throttling, I ran the following CloudWatch query to check for rejected requests:

fields @timestamp, requestURI, status, userAgent, verb

| filter status = 429

| sort @timestamp desc

| limit 100

The query results showed that a majority of the rejected API calls were originating from custom workloads deployed by the

customer within the kube-system namespace.

3. Identified the Offending Workloads Using Kubernetes Metrics

To pinpoint which pods were consuming excessive API server requests, I ran:

#kubectl top apf | sort -k2 -nr | head -10

The output revealed that several custom workloads (not native system components) were generating a high volume of API requests.

4.Investigated Why These Workloads Were Making Excessive API Calls

I inspected the logs of the suspect workloads to understand their API usage patterns:

# kubectl logs <pod-name> -n kube-system

I found that these workloads were polling the API server too frequently, leading to excessive API request bursts that exceeded the allocated limits.

5. Explained the Issue to the Customer and Recommended Disabling Unnecessary Workloads

I engaged with the customer, explaining that their workloads in the kube-system namespace were making excessive API calls, causing API server throttling. I recommended either optimizing their API calls or disabling non-essential workloads.

6. Disabled the Problematic Workloads to Restore Stability

To immediately mitigate the issue, I temporarily scaled down or removed the offending pods:

7.Verified Fix and Monitored API Server Stability

After disabling the problematic workloads, I monitored API server performance using:

fields @timestamp, requestURI, status, userAgent, verb

| filter status != 429

| sort @timestamp desc

The 429 errors dropped significantly, confirming that the API server was no longer experiencing excessive throttling.

I have provided the following recommendations to solve the issue without disabling the workloads:

  • Optimize API Calls to Reduce Load

    • They should reduce unnecessary API calls by optimizing their workload behavior:

      • Reduce polling frequency:

        • If workloads are making frequent API calls (e.g., every second), increase the polling interval to a more reasonable value.

        • Instead of aggressive API polling, use watchers or informers.

        • Example: Instead of kubectl get pods -n kube-system every second, use watch:

          • kubectl get pods -n kube-system --watch

      • Use client-side caching:

        • Instead of repeatedly querying the API server for the same data, cache results and only request updates when necessary.