Ensuring 99.99% Uptime with EKS: A Comprehensive Failover Strategy Overview
In one of my previous projects, we had to design a highly available system on Amazon EKS (spanning multiple Availability Zones) backed by a PostgreSQL database. Our key metrics were:
Recovery Time Objective (RTO): We aimed for less than 2 minutes of downtime in the worst-case scenario.
Recovery Point Objective (RPO): We aimed for zero to minimal data loss, ideally near real-time replication.
To achieve these objectives, we combined active-passive failover across regions with multi-AZ resilience within a primary region. Let me break down the approach:
EKS Multi-AZ Setup
Our EKS cluster was deployed in us-east-1 with worker nodes spread across us-east-1a, us-east-1b, and us-east-1c.
We leveraged Kubernetes Deployments with PodAntiAffinity (or topology spread constraints) to ensure pods were evenly distributed. If us-east-1a lost capacity or experienced an outage, pods running in the other AZs could keep handling traffic.
Database Failover (Intra-Region)
We used Amazon RDS for PostgreSQL in a Multi-AZ configuration. This provided a synchronous standby in a separate AZ, ensuring minimal data loss (near-zero RPO).
If the primary database (in AZ 1a) failed, RDS automatically promoted the standby (in AZ 1b) within a minute or so, which kept the system within our RTO target.
Cross-Region Failover (Active-Passive)
To mitigate full regional outages, we deployed a standby EKS cluster in us-west-2. It had the same Helm charts, Kubernetes manifests, and microservices, but scaled down to a minimal footprint (just enough to accept traffic if needed).
Our RDS database in us-east-1 replicated asynchronously to an RDS read replica in us-west-2. If us-east-1 suffered a catastrophic failure, we could promote the read replica to primary.
DNS failover happened through Amazon Route 53. We used health checks and a failover routing policy so that if the primary endpoint was deemed unhealthy, traffic would automatically route to the standby cluster in us-west-2.
Backup & Recovery Strategy
While real-time replication handled the typical failover scenario, we also scheduled daily snapshot backups to Amazon S3 for long-term retention.
If there was data corruption, we had point-in-time recovery available. This let us restore the database up to the last committed transaction before corruption occurred, again helping us maintain a very tight RPO.
Kubernetes & SRE Tooling
We used Prometheus for cluster and application metrics, with Alertmanager hooked into PagerDuty for quick response.
AWS Load Balancer Controller provisioned ALBs in front of our microservices, performing health checks on pod endpoints to ensure traffic was only routed to healthy pods.
Chaos Engineering: We regularly introduced controlled failures, such as shutting down a complete AZ or simulating DB unavailability, using tools like Litmus. This helped us verify that our failover mechanisms were working as expected and stayed within our RTO/RPO requirements.
Example of a Real Failure
During a peak load test, we intentionally killed the primary database instance in us-east-1a. Within about 30–60 seconds, RDS promoted the standby in us-east-1b with minimal data loss (basically none) and only a brief slowdown in service response times.
For region-level failures, we tested a scenario where us-east-1 was entirely unavailable. Route 53 automatically failed over to us-west-2, and we promoted the read replica there to be the new primary. This entire process took under 2 minutes, meeting our internal SLA for high availability.
Summary
By distributing EKS nodes across multiple AZs, relying on RDS Multi-AZ for near-zero data loss, and configuring a standby environment in a separate region for disaster recovery, we effectively removed single points of failure. Regular chaos engineering drills confirmed that our RTO and RPO objectives remained within acceptable limits, ensuring our critical services would remain operational even in severe failure scenarios.