General interview questions
1. Data Consistency and Replication
Q: In an active-active setup across multiple regions, how do you handle data consistency and potential conflicts if the same data can be written in more than one region?
A:
In our e-commerce platform, we operate in an active-passive configuration for writes to avoid write conflicts. The primary region (us-east-1) handles the bulk of write traffic to an Aurora cluster. Meanwhile, the secondary region (us-west-2) has a read replica continuously fed by near-synchronous replication. This ensures we have near-real-time data without risking data conflicts from concurrent writes.
If we ever needed active-active writes, we’d use Aurora Global Database with minimal replication lag. However, this introduces complexities like conflict resolution if both regions accept writes simultaneously. For our scenario, we determined that active-passive provides sufficient RPO (near zero) and RTO (2 minutes) without the overhead of handling multi-master conflicts.
2. RPO vs. RTO Trade-offs
Q: You mentioned RPO and RTO goals. Could you explain how you decide when to use synchronous versus asynchronous replication, and what the cost or architectural implications are?
A:
Our target RPO is near zero, meaning we can’t afford to lose more than a few seconds of data. For this reason, we use synchronous replication within the same region (Aurora Multi-AZ) so that writes are acknowledged only after reaching the standby.
However, across regions, fully synchronous replication can add latency. Since our RTO is 2 minutes, we’re willing to accept a small window (seconds) of asynchronous replication to the secondary region. The cost implication of synchronous cross-region replication is often higher networking overhead and potentially slowed write speeds. By sticking to asynchronous between regions, we balance cost, performance, and near-zero data loss.
3. Cost vs. Availability
Q: How do you justify or optimize costs for maintaining multiple clusters, cross-region replicas, and large amounts of infrastructure that may be idle in an active-passive setup?
A:
For our e-commerce platform, downtime during peak seasons (like Black Friday) would be extremely costly—much higher than the infrastructure spend. However, we do optimize costs by running the secondary EKS cluster and read replica in a scaled-down state when not under load.
EKS Cluster: We keep only minimal worker nodes in us-west-2 (e.g., 2 t3.medium instances).
Read Replica: Aurora’s read replica in us-west-2 scales automatically based on read workload.
Auto Scaling: If a failover triggers, we rapidly scale up in the secondary region to handle production traffic.
This approach balances high availability with cost-effectiveness.
4. Monitoring and Observability
Q: Which specific metrics or alerts do you configure to detect partial failures before they become full-blown outages?
A:
We have a Prometheus + Alertmanager stack running in the primary EKS cluster. Key metrics and alerts include:
Pod and Node Health: Alerts if more than 20% of pods in a Deployment are unavailable.
Database Latency / Error Rates: We track read/write latencies from Aurora and set thresholds for critical alerts.
Network Connectivity: We have a CronJob that pings external services and the secondary cluster to detect any networking issues.
ALB 5xx Errors: If the Application Load Balancer sees a spike in 5xx errors, we get paged immediately.
Additionally, Amazon CloudWatch monitors underlying infrastructure (EC2 node status, EBS health, etc.). This layered approach gives us an early warning if any component is failing.
5. Chaos Engineering
Q: How would you implement chaos engineering to test these failover strategies in a pre-production environment?
A:
We run monthly chaos experiments on our staging EKS cluster using Litmus. Example experiments:
AZ Outage Simulation: We intentionally cordon and drain all nodes in us-east-1a to see if pods scale up in us-east-1b and us-east-1c.
Database Failover Test: We artificially stop the Aurora primary instance to trigger RDS Multi-AZ failover.
Network Partitions: Insert iptables rules via chaos scripts that block traffic between pods and the DB.
We measure how quickly services recover, compare results to our RTO objectives, and refine our runbooks accordingly.
6. Disaster Recovery Drills
Q: How often do you run disaster recovery drills, and what does your runbook look like for recovering from a complete regional outage?
A:
We conduct quarterly DR drills where we mimic a full us-east-1 outage:
Disable Route 53 Health Checks for the primary region, forcing traffic to us-west-2.
Promote the Aurora read replica in us-west-2 to primary (often automated by a script or manual runbook).
Scale Up the secondary EKS cluster with a predefined Terraform script or AWS Auto Scaling.
Our runbook includes step-by-step instructions, estimated times for each action, and fallback procedures if something unexpected happens. We measure recovery time to ensure it meets our 2-minute RTO.
7. DNS Failover Mechanisms
Q: You mentioned using Route 53 DNS failover. What factors do you consider regarding DNS TTLs and the potential propagation delay?
A:
We set our DNS TTL to 60 seconds. Shorter TTLs can lead to more frequent DNS lookups and potentially increased costs or overhead, but it accelerates failover. We also rely on Route 53 health checks that poll our primary ALB endpoints every few seconds. If multiple checks fail, Route 53 updates the DNS to point to the secondary region.
In practice, global DNS propagation can sometimes exceed 60 seconds, but we still target sub-2-minute failovers for the majority of clients.
8. Managed vs. Self-Hosted Databases
Q: If you needed to run your own database (e.g., self-hosted Postgres) inside EKS instead of using RDS/Aurora, how would you approach replication, failover, and backups?
A:
For a self-managed Postgres:
StatefulSets: We’d create a multi-AZ StatefulSet for the primary Postgres node and one or more replicas.
Replication: We’d use streaming replication with WAL (Write-Ahead Log) shipping.
Failover Orchestration: Tools like Patroni or Stolon handle leader election if the primary fails.
Backups: Automated snapshots (e.g., pgBackRest) storing data in S3 for short RPO.
Complexity: We’d still need to manage OS patches, replication config, and DR across regions. Using RDS/Aurora offloads much of this overhead.
9. Security Considerations
Q: During failover, how do you handle secrets management and encryption so the system remains secure while switching from one region to another?
A:
We leverage AWS Secrets Manager to store and rotate secrets (DB credentials, API keys). The same secret is replicated across regions:
Encryption: We use KMS (Key Management Service) to encrypt data at rest in Aurora and in transit with TLS.
Failover: The EKS cluster in us-west-2 uses the same IAM roles and Secrets Manager configuration. Once traffic switches, pods can retrieve secrets from the new region’s Secrets Manager endpoint.
Zero Trust: We also enforce mutual TLS within the cluster using mTLS if we deploy a service mesh (e.g., Istio).
10. Versioning and Rollbacks
Q: In an outage scenario, it’s possible that a bad deploy triggered cascading failures. How do you roll back application versions safely while managing the failover process?
A:
We manage our microservices using Helm and store versions in Git (GitOps). If we suspect a bad deployment:
Pause Deployments: Temporarily scale down or pause the faulty Deployment.
Rollback: Check out the last known good revision from our Git repository, run helm rollback (or revert the Git commit and let Argo CD do the sync).
Failover: If the region is also in trouble, we failover to us-west-2. The secondary cluster is pinned to the last stable release until we confirm the fix.
Test: We run smoke tests on the stable version before restoring normal routing.
11. Testing Partial Outages
Q: Sometimes an entire AZ doesn’t fail, but becomes partially impaired. How do you detect and respond to partial degradations?
A:
We rely on Kubernetes readiness/liveness probes to detect slow or failing pods within an AZ. For example, if pods in us-east-1a start timing out:
K8s Self-Healing: The Deployment automatically spins up new pods in healthy AZs (us-east-1b, us-east-1c).
Prometheus Alerts: We have alerts on P95/P99 latency. If they spike significantly in a single AZ, we investigate.
Proactive Drain: We may manually cordon and drain that AZ if partial impairment worsens, ensuring traffic goes to healthy AZs.
12. Performance Implications
Q: For cross-region active-active setups, how do you handle latency for end users who might be routed to a distant region?
A:
In our current e-commerce scenario, we’re primarily active-passive, so requests usually flow to the primary region (lowest latency for North American customers). If we did need active-active:
Latency-Based Routing: Route 53 can direct users to the closest region.
Edge Distribution: We serve static content (images, CSS) via CloudFront to reduce latency globally.
Read-Local, Write-Global: For partial global traffic, we could serve reads from the nearest region (local read replicas), while writes are routed to the single region that handles them.
13. Implementation Details
Q: Which Kubernetes resources do you rely on most for building high availability, and could you provide a YAML snippet?
A:
Deployments with multiple replicas, readiness/liveness probes.
PodAntiAffinity rules in the affinity section of the Deployment spec to spread pods across AZs.
A simplified example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: payment-service
topologyKey: "topology.kubernetes.io/zone"
containers:
- name: payment-container
image: myregistry/payment:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
This ensures pods are balanced across different zones and are regularly health-checked.