Overview of designing a highly available system using Amazon EKS and complementary
AWS Services

1. Multi-AZ EKS Cluster Architecture

EKS Control Plane

  • AWS-Managed Control Plane: EKS automatically provisions and manages the Kubernetes control plane across multiple Availability Zones (AZs) within a region. This means that even if one AZ goes down, the control plane continues functioning from other AZs.

  • API Endpoints: EKS exposes a single endpoint that automatically balances requests across these control plane instances. No manual configuration is required for high availability at the control plane level.

Worker Nodes (Node Groups)

  • Managed Node Groups: Use EKS-managed node groups. Configure nodes so that they are spread across at least three AZs in the region (e.g., us-east-1a, 1b, 1c). This ensures that even if one AZ has issues, other nodes in the remaining AZs can continue serving traffic.

  • Auto Scaling: Attach an Auto Scaling Group (ASG) to your node groups so the cluster can quickly replace failed nodes or scale out to handle load spikes. Make sure to configure min/max sizes properly to handle typical and peak workloads.

Pod Placement

  • Pod Anti-Affinity: Use Kubernetes PodAntiAffinity rules (or topology spread constraints) to distribute pods across multiple nodes and AZs. This ensures no single AZ or node failure can take down an entire service.

  • Node Taints and Tolerations (If needed): For specialized workloads, separate them onto dedicated nodes. But always keep enough capacity in each AZ to achieve resilience.

2. Multi-AZ Data Layer and Persistent Storage

RDS (Relational Database Service) with Multi-AZ

  • Multi-AZ Deployments: For a relational database (e.g., PostgreSQL, MySQL), configure RDS with Multi-AZ support. In the event of an AZ failure, RDS automatically fails over to the standby instance in another AZ with minimal downtime.

  • Aurora: For even faster failover, you can use Amazon Aurora with multiple read replicas across AZs. Aurora also allows for cross-region replicas.

Amazon EFS (Elastic File System)

  • If you have containers requiring a shared file system, you can leverage EFS, which is inherently multi-AZ within a region. Pods can access the same file system from different AZs with EFS CSI drivers.

S3 for Static Assets and Backups

  • Store static data or backups on Amazon S3, which is regionally resilient across multiple AZs by default. Integrate these backups into your Disaster Recovery (DR) plan.

3. Multi-AZ Load Balancing and Ingress

AWS Load Balancer Controller

  • Deploy the AWS Load Balancer Controller in your EKS cluster. This automatically provisions Application Load Balancers (ALB) or Network Load Balancers (NLB) for your Kubernetes Ingress resources.

  • The load balancers themselves are multi-AZ. For ALB, traffic is automatically routed to the most available nodes across AZs.

Ingress Configuration

  • Use an ALB Ingress in front of your Kubernetes services. The ALB will automatically health-check your pods, ensuring traffic is only sent to healthy endpoints.

  • Optionally, implement mTLS or use an NLB for TCP/UDP-based workloads if you need a lower-level load balancer with pass-through capabilities.

Route 53 DNS

  • Route traffic to your ALB using Amazon Route 53 with Alias records. Route 53 can serve as your first line of failover and global traffic management if you go multi-region.

4.Multi-Region Strategy

Moving from a single-region, multi-AZ approach to a multi-region design pushes you closer to 99.99% availability. However, this adds complexity in terms of data replication and global traffic routing.

Active-Active vs. Active-Passive

  • Active-Active: Two or more EKS clusters in different AWS regions (e.g., us-east-1 and us-west-2). Traffic is balanced via Route 53 latency-based or geo-based routing. The data layer (e.g., Aurora Global Database) replicates across regions.

  • Active-Passive: Run primary services in the main region, while another region is on standby with a replicated database (Aurora cross-region read replica, or multi-region S3 for backups). Route 53 health checks automatically shift traffic to the standby region if the primary fails.

Data Replication

  • Aurora Global Database: Provides near real-time replication between regions and allows for fast failover.

  • S3 Cross-Region Replication (CRR): For object storage.

  • ElastiCache Global Datastore (Redis): If you rely on Redis for caching, you can replicate across regions.

Global Load Balancing

  • Use Route 53 with Health Checks: If the primary region is unhealthy (application-level or region-wide outages), Route 53 automatically diverts traffic to the healthy region.

  • If you prefer a more advanced global load balancing setup, you can explore solutions like AWS Global Accelerator (for TCP/UDP-based traffic).

5.High Availability and Zero-Downtime Deployments

Rolling Updates in EKS

  • Kubernetes' RollingUpdate strategy ensures new pod versions are brought up before terminating old pods, preventing downtime.

  • For safer deployments, consider Canary or Blue/Green releases with service mesh solutions like Istio or continuous delivery tools like Argo CD.

Autoscaling

  • Horizontal Pod Autoscaler (HPA): Scales pods based on CPU/memory or custom metrics.

  • Cluster Autoscaler: Automatically adds or removes EC2 worker nodes to the node group to match the needs of the cluster.

Chaos Engineering

  • Tools like Chaos Mesh or Litmus can be used to systematically inject failures and test your system’s resilience (e.g., shutting down an AZ, network latency, pod crashes). This helps validate that your design meets the desired uptime SLAs.

6. Observability and Monitoring

Amazon CloudWatch

  • Monitor underlying EC2 node performance metrics, load balancer metrics (e.g., 5xx errors), and cluster-level metrics (CPU, memory, disk).

  • Configure CloudWatch Alarms to proactively alert on anomalies.

Prometheus and Grafana

  • Deploy Prometheus and Grafana inside EKS for Kubernetes-level metrics.

  • Use Alertmanager with integration to Slack, PagerDuty, or Amazon SNS for timely incident response.

AWS X-Ray / OpenTelemetry

  • For distributed tracing to quickly diagnose issues in microservices architectures.