☁️ AWS Interview Questions
40 in-depth questions covering EC2, S3, Lambda, VPC, IAM, RDS, DynamoDB, CloudFormation, CloudFront, Auto Scaling, security, cost optimization, and performance — with theory, real configs, real-world scenarios, and common mistakes.
AWS Global Infrastructure is the physical foundation of all AWS services, organized into three tiers:
Regions:
- A geographic area (e.g., us-east-1, ap-south-1) containing multiple data centers.
- Each Region is completely independent — data does not replicate between Regions unless you configure it.
- Choose a Region based on: latency (proximity to users), compliance (data residency laws), service availability (not all services are in all Regions), and cost (pricing varies by Region).
Availability Zones (AZs):
- Each Region has 2-6 AZs — physically separate data centers within the Region.
- Connected by low-latency, high-bandwidth private fiber (< 2ms latency between AZs).
- Designed for fault isolation — separate power, cooling, networking. A fire/flood in one AZ doesn't affect others.
- Multi-AZ deployment is the foundation of high availability on AWS.
Edge Locations:
- 400+ locations worldwide used by CloudFront (CDN), Route 53 (DNS), and AWS WAF.
- Cache content close to end users for lower latency.
- Separate from Regions — there are many more Edge Locations than Regions.
# ── List all Regions ──
aws ec2 describe-regions --query "Regions[].RegionName" --output table
# ── List AZs in current Region ──
aws ec2 describe-availability-zones \
--query "AvailabilityZones[].{Zone:ZoneName,State:State,Type:ZoneType}" \
--output table
# Output:
# | Zone | State | Type |
# | us-east-1a | available | availability-zone |
# | us-east-1b | available | availability-zone |
# | us-east-1c | available | availability-zone |
# ── CloudFormation: Multi-AZ deployment ──
# Resources:
# MyVPC:
# Type: AWS::EC2::VPC
# Properties:
# CidrBlock: 10.0.0.0/16
#
# SubnetAZ1:
# Type: AWS::EC2::Subnet
# Properties:
# VpcId: !Ref MyVPC
# CidrBlock: 10.0.1.0/24
# AvailabilityZone: !Select [0, !GetAZs ""]
#
# SubnetAZ2:
# Type: AWS::EC2::Subnet
# Properties:
# VpcId: !Ref MyVPC
# CidrBlock: 10.0.2.0/24
# AvailabilityZone: !Select [1, !GetAZs ""]
# ── Check Edge Location count ──
# As of 2026: 400+ Edge Locations, 13+ Regional Edge Caches
# Edge Locations are used by CloudFront, Route 53, AWS Shield, WAF
# ── Python boto3: Get AZs programmatically ──
# import boto3
# ec2 = boto3.client("ec2", region_name="us-east-1")
# azs = ec2.describe_availability_zones()
# for az in azs["AvailabilityZones"]:
# print(f"{az['ZoneName']} - {az['State']}")
A startup deployed their entire application in a single AZ (us-east-1a). When that AZ experienced a network partition, the app was completely down for 4 hours. After the incident, they redesigned for Multi-AZ: the database moved to RDS Multi-AZ (automatic failover), web servers spanned two AZs behind an ALB, and S3 (inherently multi-AZ) stored static assets. The next AZ outage caused zero downtime.
What is the difference between a Region, an AZ, and a Local Zone? When would you use Local Zones or Wavelength Zones?
EC2 (Elastic Compute Cloud) provides virtual servers. Instance types define the hardware profile — CPU, memory, storage, and networking.
Instance family naming: m7g.xlarge = m (family) + 7 (generation) + g (Graviton/ARM) + xlarge (size).
Key families:
- T3/T4g — Burstable. Earns CPU credits when idle, spends them during spikes. Cheapest. Good for dev/test, low-traffic web servers.
- M7i/M7g — General Purpose. Balanced CPU/memory. Good for most workloads (web apps, small databases).
- C7i/C7g — Compute Optimized. High CPU-to-memory ratio. Good for batch processing, ML inference, gaming servers.
- R7i/R7g — Memory Optimized. High memory-to-CPU ratio. Good for in-memory caches (Redis), real-time analytics.
- I4i — Storage Optimized. High IOPS NVMe SSDs. Good for databases, data warehouses.
- P5/G5 — Accelerated Computing. GPUs for ML training, video encoding, HPC.
Burstable (T-family): Uses a CPU credit system. Below baseline → earns credits. Above baseline → spends credits. When credits run out → throttled to baseline (default) or charged per-vCPU-hour (unlimited mode). Check baseline: T3.medium baseline is 20% CPU.
Graviton (g suffix): ARM-based processors by AWS. Up to 40% better price/performance vs Intel. Not all software supports ARM.
# ── List available instance types in a Region ──
aws ec2 describe-instance-types \
--filters "Name=current-generation,Values=true" \
--query "InstanceTypes[].{Type:InstanceType,vCPUs:VCpuInfo.DefaultVCpus,MemGB:MemoryInfo.SizeInMiB}" \
--output table | head -20
# ── Launch an EC2 instance ──
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type m7g.large \
--key-name my-key \
--subnet-id subnet-abc123 \
--security-group-ids sg-abc123 \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=WebServer}]"
# ── Check T3 CPU credit balance ──
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUCreditBalance \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2026-05-29T00:00:00Z \
--end-time 2026-05-30T00:00:00Z \
--period 3600 --statistics Average
# ── CloudFormation: EC2 with instance type ──
# Resources:
# WebServer:
# Type: AWS::EC2::Instance
# Properties:
# InstanceType: m7g.large # Graviton for cost savings
# ImageId: ami-0abcdef1234567890
# SubnetId: !Ref PrivateSubnet
# SecurityGroupIds:
# - !Ref WebSG
# CreditSpecification:
# CPUCredits: unlimited # For T-family only
# Tags:
# - Key: Name
# Value: WebServer
# ── Instance type comparison (common choices) ──
# t3.medium: 2 vCPU, 4 GB RAM — $0.0416/hr (burstable, dev/test)
# m7g.large: 2 vCPU, 8 GB RAM — $0.0816/hr (general, Graviton)
# c7g.large: 2 vCPU, 4 GB RAM — $0.0725/hr (compute-heavy)
# r7g.large: 2 vCPU, 16 GB RAM — $0.1008/hr (memory-heavy)
A team ran their production API on t3.large instances. During a traffic spike, CPU credit balance dropped to zero and instances throttled to 20% CPU — response times jumped from 50ms to 2 seconds. They switched to m7g.large (fixed performance, Graviton) which was only 15% more expensive but provided consistent CPU. For their dev environment, they kept T3 with unlimited mode enabled — cost-effective with no throttling risk.
What is AWS Compute Optimizer and how does it recommend the right instance type?
Amazon S3 offers multiple storage classes optimized for different access patterns and cost requirements:
- S3 Standard — frequently accessed data. 99.99% availability, 11 nines durability. Highest cost per GB but no retrieval fees.
- S3 Intelligent-Tiering — automatically moves objects between tiers based on access patterns. Small monthly monitoring fee. Best when access patterns are unpredictable.
- S3 Standard-IA (Infrequent Access) — data accessed less than once a month. Lower storage cost, but per-GB retrieval fee. Minimum 30-day charge. Min object size: 128 KB.
- S3 One Zone-IA — same as IA but stored in a single AZ. 20% cheaper. Use for re-creatable data (thumbnails, transcoded media).
- S3 Glacier Instant Retrieval — archive data needing millisecond access (quarterly reports). Cheapest with instant access.
- S3 Glacier Flexible Retrieval — archive with retrieval in minutes to hours (1-5 min expedited, 3-5 hr standard, 5-12 hr bulk).
- S3 Glacier Deep Archive — cheapest storage. Retrieval in 12-48 hours. For compliance archives, 7-10 year retention.
Lifecycle Policies automate transitions between classes and expiration of objects based on age or other criteria.
# ── Upload with specific storage class ──
aws s3 cp backup.tar.gz s3://my-bucket/backups/ \
--storage-class GLACIER_IR
# ── Check current storage class of an object ──
aws s3api head-object --bucket my-bucket --key data/report.csv \
--query "StorageClass"
# ── CloudFormation: S3 Lifecycle Policy ──
# Resources:
# DataBucket:
# Type: AWS::S3::Bucket
# Properties:
# BucketName: my-data-bucket
# LifecycleConfiguration:
# Rules:
# - Id: TransitionToIA
# Status: Enabled
# Transitions:
# # After 30 days → Infrequent Access
# - TransitionInDays: 30
# StorageClass: STANDARD_IA
# # After 90 days → Glacier Instant
# - TransitionInDays: 90
# StorageClass: GLACIER_IR
# # After 365 days → Deep Archive
# - TransitionInDays: 365
# StorageClass: DEEP_ARCHIVE
# ExpirationInDays: 2555 # Delete after 7 years
#
# - Id: CleanupIncompleteUploads
# Status: Enabled
# AbortIncompleteMultipartUpload:
# DaysAfterInitiation: 7 # Clean up failed uploads
# ── AWS CLI: Set lifecycle policy ──
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle.json
# lifecycle.json example:
# {
# "Rules": [{
# "ID": "ArchiveOldLogs",
# "Status": "Enabled",
# "Filter": { "Prefix": "logs/" },
# "Transitions": [
# { "Days": 30, "StorageClass": "STANDARD_IA" },
# { "Days": 90, "StorageClass": "GLACIER" }
# ],
# "Expiration": { "Days": 365 }
# }]
# }
# ── Storage class cost comparison (us-east-1, per GB/month) ──
# Standard: $0.023
# Intelligent-Tier: $0.023 (+ $0.0025/1000 objects monitoring)
# Standard-IA: $0.0125 (+ $0.01/GB retrieval)
# One Zone-IA: $0.01 (+ $0.01/GB retrieval)
# Glacier Instant: $0.004 (+ $0.03/GB retrieval)
# Glacier Flexible: $0.0036 (+ $0.01-$0.03/GB retrieval)
# Deep Archive: $0.00099 (+ $0.02/GB retrieval)
A healthcare company stored 50TB of patient records in S3 Standard — costing $1,150/month. Analysis showed 90% of records were accessed only during the first 30 days. They implemented a lifecycle policy: Standard for 30 days → Standard-IA for 30-90 days → Glacier Instant Retrieval after 90 days. Monthly cost dropped to $320 — a 72% reduction. Compliance-required 7-year retention records moved to Deep Archive at $0.50/TB/month.
What is S3 Intelligent-Tiering and when does it make more sense than manual lifecycle policies?
IAM (Identity and Access Management) controls who (authentication) can do what (authorization) in your AWS account.
Core components:
- Users — individual identities with long-term credentials (password + access keys). Map to a person or application. Best practice: minimize IAM users, use IAM Identity Center (SSO) instead.
- Groups — collections of users. Attach policies to groups, not individual users. E.g., "Developers" group, "DBAdmins" group.
- Roles — temporary credentials assumed by users, services, or accounts. No long-term credentials. EC2 instances, Lambda functions, and cross-account access use roles. Most important IAM concept.
- Policies — JSON documents that define permissions. Attached to users, groups, or roles.
Policy structure: Effect (Allow/Deny) + Action (e.g., s3:GetObject) + Resource (ARN of the resource). Deny always wins over Allow.
Policy types:
- AWS Managed — predefined by AWS (e.g., AmazonS3ReadOnlyAccess).
- Customer Managed — created by you, reusable across entities.
- Inline — embedded directly in a single user/group/role. Avoid when possible.
Least Privilege Principle: Grant only the minimum permissions needed. Start with zero permissions and add as needed.
# ── IAM Policy JSON structure ──
# {
# "Version": "2012-10-17",
# "Statement": [
# {
# "Sid": "AllowS3ReadOnly",
# "Effect": "Allow",
# "Action": [
# "s3:GetObject",
# "s3:ListBucket"
# ],
# "Resource": [
# "arn:aws:s3:::my-bucket",
# "arn:aws:s3:::my-bucket/*"
# ]
# },
# {
# "Sid": "DenyDeleteBucket",
# "Effect": "Deny",
# "Action": "s3:DeleteBucket",
# "Resource": "*"
# }
# ]
# }
# ── Create an IAM Role for EC2 ──
aws iam create-role --role-name EC2-S3-Reader \
--assume-role-policy-document \
'{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
# Attach a managed policy
aws iam attach-role-policy --role-name EC2-S3-Reader \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
# Create instance profile (required for EC2)
aws iam create-instance-profile --instance-profile-name EC2-S3-Reader
aws iam add-role-to-instance-profile \
--instance-profile-name EC2-S3-Reader \
--role-name EC2-S3-Reader
# ── CloudFormation: IAM Role for Lambda ──
# Resources:
# LambdaExecutionRole:
# Type: AWS::IAM::Role
# Properties:
# AssumeRolePolicyDocument:
# Version: "2012-10-17"
# Statement:
# - Effect: Allow
# Principal:
# Service: lambda.amazonaws.com
# Action: sts:AssumeRole
# ManagedPolicyArns:
# - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
# Policies:
# - PolicyName: DynamoDBAccess
# PolicyDocument:
# Version: "2012-10-17"
# Statement:
# - Effect: Allow
# Action:
# - dynamodb:GetItem
# - dynamodb:PutItem
# - dynamodb:Query
# Resource: !GetAtt MyTable.Arn
A developer stored AWS access keys in their application code and pushed it to a public GitHub repo. Within 20 minutes, crypto miners had spun up 50 expensive GPU instances. The fix: rotated all credentials, enabled MFA, switched to IAM Roles (no access keys needed for EC2/Lambda), and enabled AWS CloudTrail to audit all API calls. They also set up billing alarms and AWS Organizations SCPs to restrict instance types.
What is the difference between IAM Roles and IAM Identity Center (SSO)? When do you use each?
A VPC (Virtual Private Cloud) is your isolated virtual network in AWS. You control the IP range, subnets, routing, and security.
Key components:
- CIDR Block — the IP address range of your VPC (e.g., 10.0.0.0/16 = 65,536 IPs). Cannot be changed after creation (but you can add secondary CIDRs).
- Subnets — subdivisions of the VPC CIDR, each in a single AZ. Two types:
- Public subnet — has a route to an Internet Gateway. Resources get public IPs.
- Private subnet — no direct internet access. Resources communicate via NAT Gateway or VPC endpoints.
- Route Tables — rules that determine where traffic goes. Each subnet is associated with one route table.
- Public subnet route:
0.0.0.0/0 → igw-xxx(Internet Gateway). - Private subnet route:
0.0.0.0/0 → nat-xxx(NAT Gateway).
- Public subnet route:
- Internet Gateway (IGW) — allows resources in public subnets to reach the internet (and be reached from the internet). One per VPC.
- NAT Gateway — allows resources in private subnets to reach the internet (for updates, API calls) but prevents inbound connections. Deployed in a public subnet. Costs: hourly + per-GB data processed.
# ── CloudFormation: Complete VPC setup ──
# Resources:
# VPC:
# Type: AWS::EC2::VPC
# Properties:
# CidrBlock: 10.0.0.0/16
# EnableDnsSupport: true
# EnableDnsHostnames: true
# Tags: [{Key: Name, Value: MyVPC}]
#
# InternetGateway:
# Type: AWS::EC2::InternetGateway
# AttachGateway:
# Type: AWS::EC2::VPCGatewayAttachment
# Properties:
# VpcId: !Ref VPC
# InternetGatewayId: !Ref InternetGateway
#
# PublicSubnet1:
# Type: AWS::EC2::Subnet
# Properties:
# VpcId: !Ref VPC
# CidrBlock: 10.0.1.0/24
# AvailabilityZone: !Select [0, !GetAZs ""]
# MapPublicIpOnLaunch: true
#
# PrivateSubnet1:
# Type: AWS::EC2::Subnet
# Properties:
# VpcId: !Ref VPC
# CidrBlock: 10.0.10.0/24
# AvailabilityZone: !Select [0, !GetAZs ""]
#
# NatGateway:
# Type: AWS::EC2::NatGateway
# Properties:
# SubnetId: !Ref PublicSubnet1 # NAT lives in PUBLIC subnet
# AllocationId: !GetAtt NatEIP.AllocationId
# NatEIP:
# Type: AWS::EC2::EIP
#
# PublicRouteTable:
# Type: AWS::EC2::RouteTable
# Properties:
# VpcId: !Ref VPC
# PublicRoute:
# Type: AWS::EC2::Route
# Properties:
# RouteTableId: !Ref PublicRouteTable
# DestinationCidrBlock: 0.0.0.0/0
# GatewayId: !Ref InternetGateway # → Internet
#
# PrivateRouteTable:
# Type: AWS::EC2::RouteTable
# Properties:
# VpcId: !Ref VPC
# PrivateRoute:
# Type: AWS::EC2::Route
# Properties:
# RouteTableId: !Ref PrivateRouteTable
# DestinationCidrBlock: 0.0.0.0/0
# NatGatewayId: !Ref NatGateway # → NAT (outbound only)
# ── AWS CLI: Create VPC ──
aws ec2 create-vpc --cidr-block 10.0.0.0/16 \
--tag-specifications "ResourceType=vpc,Tags=[{Key=Name,Value=MyVPC}]"
A company put their database EC2 instance in a public subnet with a public IP. A port scan found the open MySQL port and brute-forced the weak password. After the breach, they redesigned: databases moved to private subnets (no public IP, no IGW route). Application servers in private subnets accessed the internet via NAT Gateway for package updates. Only the ALB sat in public subnets. NAT Gateway cost ($0.045/hr + $0.045/GB) was trivial compared to the breach cost.
What is a VPC endpoint and how does it avoid NAT Gateway costs for AWS service access?
AWS provides two layers of network security that work together:
Security Groups (SGs) — instance-level firewall:
- Stateful — if you allow inbound traffic, the response is automatically allowed outbound (and vice versa).
- Attached to ENIs (network interfaces) on EC2, RDS, Lambda VPC, etc.
- Allow rules only — no explicit deny. Anything not allowed is implicitly denied.
- Can reference other Security Groups as source/destination (e.g., "allow traffic from the ALB SG").
- All rules evaluated together — if any rule allows, traffic passes.
- Default: all outbound allowed, all inbound denied.
NACLs (Network Access Control Lists) — subnet-level firewall:
- Stateless — you must explicitly allow both inbound AND outbound (including ephemeral ports for responses).
- Applied to subnets — affects all resources in the subnet.
- Allow AND deny rules — can explicitly block specific IPs.
- Rules evaluated in order by rule number — first match wins.
- Default NACL: allows all traffic. Custom NACL: denies all by default.
Evaluation order: Inbound traffic hits NACL first → then Security Group. Outbound: Security Group first → then NACL.
# ── Security Group: Web server ──
aws ec2 create-security-group \
--group-name WebSG --description "Web server SG" \
--vpc-id vpc-abc123
# Allow HTTPS from anywhere
aws ec2 authorize-security-group-ingress \
--group-id sg-abc123 \
--protocol tcp --port 443 --cidr 0.0.0.0/0
# Allow app traffic from ALB Security Group (SG reference!)
aws ec2 authorize-security-group-ingress \
--group-id sg-abc123 \
--protocol tcp --port 8080 \
--source-group sg-alb456 # ← Reference another SG
# ── NACL: Block a specific IP range ──
aws ec2 create-network-acl-entry \
--network-acl-id acl-abc123 \
--rule-number 50 --protocol tcp \
--port-range From=0,To=65535 \
--cidr-block 203.0.113.0/24 \
--rule-action deny --ingress
# Allow HTTPS inbound (rule 100 — evaluated after rule 50)
aws ec2 create-network-acl-entry \
--network-acl-id acl-abc123 \
--rule-number 100 --protocol tcp \
--port-range From=443,To=443 \
--cidr-block 0.0.0.0/0 \
--rule-action allow --ingress
# Allow ephemeral ports outbound (NACL is STATELESS!)
aws ec2 create-network-acl-entry \
--network-acl-id acl-abc123 \
--rule-number 100 --protocol tcp \
--port-range From=1024,To=65535 \
--cidr-block 0.0.0.0/0 \
--rule-action allow --egress
# ── Key comparison ──
# Feature | Security Group | NACL
# Level | Instance (ENI) | Subnet
# Stateful? | Yes | No
# Rules | Allow only | Allow + Deny
# Evaluation | All rules together | Ordered by rule #
# Default inbound | Deny all | Allow all (default NACL)
# SG references | Yes | No (CIDR only)
A web app was under a DDoS attack from a specific IP range (203.0.113.0/24). Security Groups couldn't help because they don't have deny rules — they could only allow legitimate traffic. The team added a NACL deny rule (rule number 50, lower than the allow rules) to block the attacking IP range at the subnet level. Traffic from those IPs was dropped before reaching any EC2 instance. For ongoing protection, they also enabled AWS Shield and WAF.
Can you use Security Group references across VPCs? What about across accounts with VPC Peering?
EBS (Elastic Block Store) provides persistent block storage volumes for EC2 instances. Different volume types optimize for different workloads:
SSD-backed (random I/O):
- gp3 (General Purpose SSD) — baseline 3,000 IOPS + 125 MB/s throughput, independently scalable to 16,000 IOPS and 1,000 MB/s. Best default choice. 20% cheaper than gp2.
- gp2 (previous gen) — IOPS scales with volume size (3 IOPS/GB, burst to 3,000). Being replaced by gp3.
- io2 Block Express — provisioned IOPS up to 256,000 IOPS. Sub-millisecond latency. For critical databases (Oracle, SAP HANA). 99.999% durability (vs 99.8-99.9% for others).
HDD-backed (sequential I/O):
- st1 (Throughput Optimized HDD) — max 500 MB/s throughput. For big data, data warehouses, log processing. Cannot be a boot volume.
- sc1 (Cold HDD) — cheapest EBS. Max 250 MB/s. For infrequently accessed data. Cannot be a boot volume.
Key concepts:
- IOPS = Input/Output Operations Per Second — measures random read/write speed (databases).
- Throughput = MB/s — measures sequential read/write speed (big data, streaming).
- gp3 advantage: you can provision IOPS and throughput independently of volume size (unlike gp2).
- Snapshots: point-in-time backups stored in S3. Incremental — only changed blocks are saved.
# ── Create a gp3 volume with custom IOPS ──
aws ec2 create-volume \
--volume-type gp3 \
--size 500 \
--iops 10000 \
--throughput 500 \
--availability-zone us-east-1a \
--tag-specifications "ResourceType=volume,Tags=[{Key=Name,Value=AppData}]"
# ── Modify existing volume (no downtime!) ──
aws ec2 modify-volume \
--volume-id vol-abc123 \
--volume-type gp3 \
--iops 10000 \
--throughput 500 \
--size 1000
# ── CloudFormation: EBS volume for database ──
# Resources:
# DatabaseVolume:
# Type: AWS::EC2::Volume
# Properties:
# VolumeType: io2
# Iops: 50000
# Size: 500 # 500 GB
# AvailabilityZone: !GetAtt DBInstance.AvailabilityZone
# Encrypted: true
# KmsKeyId: !Ref MyKMSKey
# Tags:
# - Key: Name
# Value: DatabaseVolume
# ── Create a snapshot (backup) ──
aws ec2 create-snapshot \
--volume-id vol-abc123 \
--description "Pre-upgrade backup" \
--tag-specifications "ResourceType=snapshot,Tags=[{Key=Name,Value=PreUpgrade}]"
# ── Volume type comparison ──
# Type | Max IOPS | Max Throughput | Use Case
# gp3 | 16,000 | 1,000 MB/s | Default choice, boot volumes
# io2 | 256,000 | 4,000 MB/s | Critical databases
# st1 | 500 | 500 MB/s | Big data, logs (sequential)
# sc1 | 250 | 250 MB/s | Cold storage (cheapest)
# ── Monitor IOPS usage ──
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS \
--metric-name VolumeReadOps \
--dimensions Name=VolumeId,Value=vol-abc123 \
--start-time 2026-05-29T00:00:00Z \
--end-time 2026-05-30T00:00:00Z \
--period 300 --statistics Sum
A PostgreSQL database on gp2 (100GB = 300 baseline IOPS) suffered from burst credit exhaustion during batch processing. Response times spiked when credits ran out. Migrating to gp3 with 10,000 provisioned IOPS (independent of size) solved the problem and was actually cheaper — gp3 base price is 20% lower than gp2, and they only paid for the IOPS they needed. The volume modification was done live with no downtime using modify-volume.
What are EBS Multi-Attach and io2 Block Express? When would you use them?
Route 53 is AWS's managed DNS service. It provides domain registration, DNS routing, and health checking.
Hosted Zones:
- Public Hosted Zone — resolves domain names from the internet (e.g., www.example.com → ALB IP).
- Private Hosted Zone — resolves names only within your VPC (e.g., db.internal → RDS private IP).
Routing Policies:
- Simple — one record, one or more values. No health checks. Good for single resources.
- Weighted — distribute traffic by percentage (e.g., 90% to v1, 10% to v2). Great for canary deployments.
- Latency-based — routes to the Region with lowest latency for the user. Best for multi-region apps.
- Failover — primary/secondary setup. If primary health check fails → routes to secondary. Active-passive DR.
- Geolocation — routes based on user's geographic location (continent, country). For compliance, localization.
- Geoproximity — routes based on geographic distance with bias to shift traffic between regions.
- Multivalue Answer — returns multiple healthy IPs (up to 8). Client-side load balancing with health checks.
Health Checks: Route 53 monitors endpoint health (HTTP/HTTPS/TCP). Unhealthy records are removed from DNS responses. Can trigger CloudWatch alarms.
Alias Records: Route 53-specific feature — point to AWS resources (ALB, CloudFront, S3) without a CNAME. Free of charge. Works at the zone apex (example.com, not just www.example.com).
# ── Create a Hosted Zone ──
aws route53 create-hosted-zone \
--name example.com \
--caller-reference "2026-05-30"
# ── Create an Alias record pointing to ALB ──
# aws route53 change-resource-record-sets \
# --hosted-zone-id Z1234567890 \
# --change-batch '{
# "Changes": [{
# "Action": "CREATE",
# "ResourceRecordSet": {
# "Name": "www.example.com",
# "Type": "A",
# "AliasTarget": {
# "HostedZoneId": "Z35SXDOTRQ7X7K",
# "DNSName": "my-alb-1234567890.us-east-1.elb.amazonaws.com",
# "EvaluateTargetHealth": true
# }
# }
# }]
# }'
# ── Health Check ──
aws route53 create-health-check --caller-reference "web-hc-2026" \
--health-check-config \
Type=HTTPS,FullyQualifiedDomainName=www.example.com,\
Port=443,ResourcePath=/health,RequestInterval=30,FailureThreshold=3
# ── CloudFormation: Failover routing ──
# Resources:
# PrimaryRecord:
# Type: AWS::Route53::RecordSet
# Properties:
# HostedZoneId: !Ref MyHostedZone
# Name: api.example.com
# Type: A
# AliasTarget:
# HostedZoneId: !GetAtt PrimaryALB.CanonicalHostedZoneID
# DNSName: !GetAtt PrimaryALB.DNSName
# EvaluateTargetHealth: true
# Failover: PRIMARY
# SetIdentifier: primary
# HealthCheckId: !Ref PrimaryHealthCheck
#
# SecondaryRecord:
# Type: AWS::Route53::RecordSet
# Properties:
# HostedZoneId: !Ref MyHostedZone
# Name: api.example.com
# Type: A
# AliasTarget:
# HostedZoneId: !GetAtt SecondaryALB.CanonicalHostedZoneID
# DNSName: !GetAtt SecondaryALB.DNSName
# Failover: SECONDARY
# SetIdentifier: secondary
# ── Routing policy comparison ──
# Policy | Use Case | Health Check?
# Simple | Single resource | No
# Weighted | Canary, A/B testing | Yes
# Latency | Multi-region, lowest ping | Yes
# Failover | Active-passive DR | Yes (primary)
# Geolocation | Compliance, localization | Yes
# Multivalue | Client-side load balancing | Yes
A global SaaS app deployed in us-east-1 and eu-west-1. Initially, all users hit us-east-1 — European users experienced 200ms latency. After switching to Route 53 latency-based routing with health checks, European users were automatically routed to eu-west-1 (30ms latency). When eu-west-1 had an outage, health checks detected it within 30 seconds, and Route 53 automatically routed all traffic to us-east-1 — no manual intervention needed.
What is the difference between a CNAME record and an Alias record? Why can't you use CNAME at the zone apex?
AWS offers three types of Elastic Load Balancers (ELB):
Application Load Balancer (ALB) — Layer 7 (HTTP/HTTPS):
- Routes based on URL path (/api/* → backend, /images/* → static), hostname (api.example.com vs www.example.com), HTTP headers, and query strings.
- Supports WebSocket, HTTP/2, gRPC.
- Target types: EC2 instances, IP addresses, Lambda functions, containers (ECS/EKS).
- Built-in features: sticky sessions, authentication (Cognito/OIDC), request/response modification, WAF integration.
- Best for: web applications, microservices, container-based architectures.
Network Load Balancer (NLB) — Layer 4 (TCP/UDP/TLS):
- Routes based on IP + port only. Does not inspect HTTP content.
- Ultra-low latency (~100 microseconds vs ~400ms for ALB).
- Handles millions of requests/sec with static IP addresses.
- Supports TLS termination, TCP pass-through, and UDP.
- Preserves client source IP (ALB replaces it with its own).
- Best for: TCP/UDP services, gaming, IoT, extreme performance, static IPs.
Classic Load Balancer (CLB) — Layer 4 + basic Layer 7 (legacy):
- Do not use for new projects. AWS recommends migrating to ALB or NLB.
- Limited feature set. No path-based routing, no host-based routing.
# ── CloudFormation: ALB with path-based routing ──
# Resources:
# ALB:
# Type: AWS::ElasticLoadBalancingV2::LoadBalancer
# Properties:
# Type: application
# Scheme: internet-facing
# Subnets: [!Ref PublicSubnet1, !Ref PublicSubnet2]
# SecurityGroups: [!Ref ALBSG]
#
# Listener:
# Type: AWS::ElasticLoadBalancingV2::Listener
# Properties:
# LoadBalancerArn: !Ref ALB
# Port: 443
# Protocol: HTTPS
# Certificates:
# - CertificateArn: !Ref SSLCert
# DefaultActions:
# - Type: forward
# TargetGroupArn: !Ref WebTG
#
# APIRule:
# Type: AWS::ElasticLoadBalancingV2::ListenerRule
# Properties:
# ListenerArn: !Ref Listener
# Priority: 10
# Conditions:
# - Field: path-pattern
# Values: ["/api/*"]
# Actions:
# - Type: forward
# TargetGroupArn: !Ref APITG
#
# WebTG:
# Type: AWS::ElasticLoadBalancingV2::TargetGroup
# Properties:
# VpcId: !Ref VPC
# Port: 80
# Protocol: HTTP
# TargetType: instance
# HealthCheckPath: /health
# HealthCheckIntervalSeconds: 15
# HealthyThresholdCount: 2
# UnhealthyThresholdCount: 3
# ── AWS CLI: Create NLB with static IP ──
aws elbv2 create-load-balancer \
--name my-nlb \
--type network \
--subnets subnet-aaa subnet-bbb
# NLB gets one static IP per AZ
# Useful for: DNS whitelisting, firewall rules, clients that can't do DNS
# ── Comparison ──
# Feature | ALB | NLB | CLB
# Layer | 7 (HTTP/HTTPS) | 4 (TCP/UDP) | 4 + basic 7
# Latency | ~ms | ~μs | ~ms
# Path routing | ✅ | ❌ | ❌
# WebSocket | ✅ | ✅ (TCP) | ❌
# Static IP | ❌ (use GA) | ✅ | ❌
# Lambda target | ✅ | ❌ | ❌
# Client IP | X-Forwarded-For | Preserved | X-Forwarded-For
# WAF | ✅ | ❌ | ❌
# Cost | $$ | $$ | $
A microservices app needed path-based routing (/api → API service, /auth → auth service, / → frontend) with WAF protection. ALB handled this perfectly with listener rules. Later, they added a real-time gaming service that needed TCP connections with ultra-low latency and static IPs for firewall whitelisting. An NLB was added for that service. Both load balancers ran in parallel — ALB for HTTP traffic, NLB for TCP.
What is cross-zone load balancing? How does it differ between ALB and NLB?
EC2 Auto Scaling automatically adjusts the number of EC2 instances based on demand, ensuring availability and cost optimization.
Components:
- Launch Template — defines the instance configuration (AMI, instance type, key pair, security groups, user data). Replaces the older Launch Configuration. Supports versioning.
- Auto Scaling Group (ASG) — manages the fleet. Defines min, max, and desired capacity. Spans multiple AZs for HA.
- Scaling Policies — rules that trigger scaling actions.
Scaling Policy Types:
- Target Tracking — simplest. Set a target (e.g., "keep average CPU at 50%"). ASG adds/removes instances to maintain the target. Recommended for most cases.
- Step Scaling — define steps: if CPU > 70% add 2 instances, if CPU > 90% add 4. More control than target tracking.
- Scheduled — scale at specific times (e.g., scale up at 9 AM, down at 6 PM). For predictable traffic patterns.
- Predictive — uses ML to forecast traffic and pre-scale. Combines with target tracking.
Cooldown: After a scaling action, ASG waits (default 300 seconds) before acting again — prevents rapid scaling oscillation.
Health Checks: ASG uses EC2 status checks (default) or ELB health checks. Unhealthy instances are terminated and replaced.
# ── Create Launch Template ──
aws ec2 create-launch-template \
--launch-template-name WebServerTemplate \
--version-description "v1" \
--launch-template-data '{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "m7g.large",
"KeyName": "my-key",
"SecurityGroupIds": ["sg-abc123"],
"UserData": "BASE64_ENCODED_USER_DATA"
}'
# User data script (base64-encode before passing):
# #!/bin/bash
# yum update -y
# yum install -y httpd
# systemctl start httpd
# ── Create Auto Scaling Group ──
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name WebASG \
--launch-template LaunchTemplateName=WebServerTemplate,Version=\$Latest \
--min-size 2 --max-size 10 --desired-capacity 3 \
--vpc-zone-identifier "subnet-aaa,subnet-bbb" \
--target-group-arns arn:aws:elasticloadbalancing:...:targetgroup/WebTG/... \
--health-check-type ELB \
--health-check-grace-period 300
# ── Target Tracking Policy (recommended) ──
aws autoscaling put-scaling-policy \
--auto-scaling-group-name WebASG \
--policy-name CPUTargetTracking \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{
"TargetValue": 50.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
# ── Scheduled Scaling (predictable traffic) ──
aws autoscaling put-scheduled-update-group-action \
--auto-scaling-group-name WebASG \
--scheduled-action-name ScaleUpMorning \
--recurrence "0 9 * * MON-FRI" \
--desired-capacity 8
aws autoscaling put-scheduled-update-group-action \
--auto-scaling-group-name WebASG \
--scheduled-action-name ScaleDownEvening \
--recurrence "0 18 * * MON-FRI" \
--desired-capacity 3
# ── CloudFormation: ASG with Target Tracking ──
# Resources:
# ASG:
# Type: AWS::AutoScaling::AutoScalingGroup
# Properties:
# LaunchTemplate:
# LaunchTemplateId: !Ref LaunchTemplate
# Version: !GetAtt LaunchTemplate.LatestVersionNumber
# MinSize: 2
# MaxSize: 10
# DesiredCapacity: 3
# VPCZoneIdentifier: [!Ref SubnetA, !Ref SubnetB]
# TargetGroupARNs: [!Ref WebTG]
# HealthCheckType: ELB
# HealthCheckGracePeriod: 300
An e-commerce site ran 4 EC2 instances 24/7 — overprovisioned for 80% of the day, underprovisioned during flash sales. After implementing Auto Scaling with target tracking (CPU target 50%) + scheduled scaling (pre-scale for known sales), the fleet ranged from 3 instances at night to 15 during sales. Monthly EC2 cost dropped 40% while performance improved during peaks.
What is a warm pool in Auto Scaling? How does it reduce scale-out latency?
RDS (Relational Database Service) provides managed databases (MySQL, PostgreSQL, SQL Server, Oracle, MariaDB). Two key high-availability features:
Multi-AZ Deployment:
- Creates a synchronous standby replica in another AZ.
- Purpose: high availability and failover — NOT for read scaling.
- Automatic failover in 60-120 seconds if primary fails (AZ outage, hardware failure, patching).
- Standby is not accessible for reads (standby only).
- Same endpoint — DNS automatically switches to standby on failover.
Read Replicas:
- Creates asynchronous copies for read scaling.
- Up to 15 Read Replicas per primary (5 for non-Aurora).
- Each replica has its own endpoint — application must direct read traffic to replicas.
- Can be in the same Region, cross-Region, or cross-account.
- Can be promoted to standalone database (for migration or DR).
- Replication lag: typically seconds, but can increase under heavy write load.
Amazon Aurora:
- AWS-designed, cloud-native database (MySQL/PostgreSQL compatible).
- Storage: auto-scales up to 128 TB, replicated 6 ways across 3 AZs.
- Up to 5x faster than MySQL, 3x faster than PostgreSQL.
- Aurora Replicas share the same storage — near-zero replication lag.
- Aurora Serverless v2: auto-scales compute (ACUs) based on load. Pay per second.
# ── Create RDS Multi-AZ instance ──
aws rds create-db-instance \
--db-instance-identifier my-db \
--db-instance-class db.r7g.large \
--engine postgres \
--master-username admin \
--master-user-password "****" \
--allocated-storage 100 \
--multi-az \
--storage-encrypted \
--vpc-security-group-ids sg-abc123 \
--db-subnet-group-name my-db-subnets
# ── Create Read Replica ──
aws rds create-db-instance-read-replica \
--db-instance-identifier my-db-replica \
--source-db-instance-identifier my-db \
--db-instance-class db.r7g.large \
--availability-zone us-east-1b
# ── Create Aurora Cluster ──
aws rds create-db-cluster \
--db-cluster-identifier my-aurora \
--engine aurora-postgresql \
--engine-version 15.4 \
--master-username admin \
--master-user-password "****" \
--vpc-security-group-ids sg-abc123 \
--db-subnet-group-name my-db-subnets \
--storage-encrypted
# Add Aurora instances (writer + reader)
aws rds create-db-instance \
--db-instance-identifier my-aurora-writer \
--db-cluster-identifier my-aurora \
--db-instance-class db.r7g.large \
--engine aurora-postgresql
aws rds create-db-instance \
--db-instance-identifier my-aurora-reader \
--db-cluster-identifier my-aurora \
--db-instance-class db.r7g.large \
--engine aurora-postgresql
# ── Aurora endpoints ──
# Writer endpoint: my-aurora.cluster-xxxx.us-east-1.rds.amazonaws.com
# Reader endpoint: my-aurora.cluster-ro-xxxx.us-east-1.rds.amazonaws.com
# Reader endpoint auto-load-balances across all Aurora Replicas
# ── Comparison ──
# Feature | Multi-AZ | Read Replica | Aurora
# Purpose | High availability | Read scaling | Both
# Replication | Synchronous | Asynchronous | Shared storage
# Failover | Automatic (60s) | Manual promotion | Automatic (30s)
# Read traffic | No (standby) | Yes (own endpoint)| Yes (reader EP)
# Cross-Region | No | Yes | Yes (Global DB)
A SaaS app hit a database bottleneck — the primary PostgreSQL RDS instance was at 90% CPU with read-heavy analytics queries competing with transactional writes. They created 2 Read Replicas and directed analytics queries to the reader endpoint. Primary CPU dropped to 35%. They also enabled Multi-AZ for the primary to survive AZ failures. Six months later, they migrated to Aurora PostgreSQL — got auto-scaling storage, 30-second failover (vs 120 seconds), and near-zero replication lag.
What is Aurora Global Database? How does it provide cross-region disaster recovery?
AWS Lambda runs your code without provisioning servers. You pay only for compute time consumed.
Execution model:
- Event triggers Lambda (API Gateway, S3, SQS, CloudWatch, etc.).
- Lambda creates an execution environment (container) with your code + runtime.
- Your handler function runs and returns a response.
- The environment is frozen (kept warm for ~15-30 minutes) for potential reuse.
- Next invocation may reuse the warm environment (warm start) or create a new one (cold start).
Cold Start:
- Time to create a new execution environment: download code, start runtime, run initialization.
- Adds 100ms-2s+ latency depending on runtime (Python/Node fastest, Java/C# slowest) and package size.
- Mitigations: Provisioned Concurrency (pre-warm environments), SnapStart (Java snapshot), smaller packages, keep functions warm.
Concurrency:
- Each concurrent invocation uses one execution environment.
- Account limit: 1,000 concurrent executions (default, can be increased).
- Reserved Concurrency: guarantees capacity for a function (but limits it too).
- Provisioned Concurrency: pre-creates warm environments — no cold starts. Costs money even when idle.
Layers: Shared code/libraries packaged separately. Up to 5 layers per function. Useful for common dependencies (numpy, SDK, custom utils).
Limits: 15 minutes max timeout, 10 GB memory, 250 MB deployment package (unzipped), 512 MB /tmp storage (configurable to 10 GB).
# ── Create a Lambda function ──
aws lambda create-function \
--function-name ProcessOrder \
--runtime python3.12 \
--handler app.handler \
--role arn:aws:iam::123456789012:role/LambdaExecRole \
--zip-file fileb://function.zip \
--timeout 30 \
--memory-size 512 \
--environment Variables="{DB_HOST=mydb.cluster-xxx.rds.amazonaws.com}"
# ── Python Lambda handler ──
# import json
# import boto3
#
# # Initialization code runs ONCE per cold start (reused on warm starts)
# dynamodb = boto3.resource("dynamodb")
# table = dynamodb.Table("Orders")
#
# def handler(event, context):
# """Triggered by API Gateway POST /orders"""
# body = json.loads(event["body"])
# table.put_item(Item={
# "orderId": body["id"],
# "amount": body["amount"],
# "status": "pending"
# })
# return {
# "statusCode": 201,
# "body": json.dumps({"message": "Order created"})
# }
# ── Set Provisioned Concurrency (no cold starts) ──
aws lambda put-provisioned-concurrency-config \
--function-name ProcessOrder \
--qualifier prod \
--provisioned-concurrent-executions 50
# ── Set Reserved Concurrency (limit + guarantee) ──
aws lambda put-function-concurrency \
--function-name ProcessOrder \
--reserved-concurrent-executions 100
# ── Create a Layer (shared dependencies) ──
# zip -r layer.zip python/ # python/lib/python3.12/site-packages/...
aws lambda publish-layer-version \
--layer-name common-utils \
--zip-file fileb://layer.zip \
--compatible-runtimes python3.12
# Attach layer to function
aws lambda update-function-configuration \
--function-name ProcessOrder \
--layers arn:aws:lambda:us-east-1:123456789012:layer:common-utils:1
# ── CloudFormation: Lambda + API Gateway ──
# Resources:
# ProcessOrderFn:
# Type: AWS::Lambda::Function
# Properties:
# FunctionName: ProcessOrder
# Runtime: python3.12
# Handler: app.handler
# Code:
# S3Bucket: my-deployment-bucket
# S3Key: function.zip
# MemorySize: 512
# Timeout: 30
# Role: !GetAtt LambdaRole.Arn
# Environment:
# Variables:
# TABLE_NAME: !Ref OrdersTable
A payment processing Lambda had 2-second cold starts on Java runtime with a 50MB package. P99 latency was 3 seconds — unacceptable for checkout. The team applied three fixes: (1) switched to Python for the API handler (cold start dropped to 200ms), (2) moved heavy shared libraries to a Layer (reduced package size), (3) enabled Provisioned Concurrency with 50 instances during business hours. P99 dropped to 80ms.
What is Lambda SnapStart and how does it differ from Provisioned Concurrency?
S3 security operates at multiple layers:
Block Public Access (account or bucket level):
- Four settings that override any policy or ACL that would make a bucket/object public.
- Always enable all four at the account level unless you specifically need public access (like a static website).
- This is the #1 S3 security setting — prevents accidental public exposure.
Bucket Policies (resource-based):
- JSON policies attached to the bucket. Control who can access the bucket and its objects.
- Can grant cross-account access, require encryption, restrict by IP, enforce HTTPS.
IAM Policies (identity-based):
- Attached to IAM users/roles. Define what S3 actions they can perform.
- Both bucket policy AND IAM policy must allow access (unless one explicitly allows and neither denies).
Encryption:
- SSE-S3 — AWS manages keys. Simplest. Default for new buckets.
- SSE-KMS — AWS KMS manages keys. Audit trail in CloudTrail. Key rotation. Cross-account control.
- SSE-C — you provide the key with each request. You manage key storage.
- Client-side — encrypt before uploading. Maximum control.
Presigned URLs: temporary URLs that grant time-limited access to private objects. Generated by the server, shared with clients. No AWS credentials needed by the client.
# ── Enable Block Public Access (account level) ──
aws s3control put-public-access-block \
--account-id 123456789012 \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,\
BlockPublicPolicy=true,RestrictPublicBuckets=true
# ── Bucket Policy: Enforce HTTPS + encryption ──
# {
# "Version": "2012-10-17",
# "Statement": [
# {
# "Sid": "DenyHTTP",
# "Effect": "Deny",
# "Principal": "*",
# "Action": "s3:*",
# "Resource": [
# "arn:aws:s3:::my-bucket",
# "arn:aws:s3:::my-bucket/*"
# ],
# "Condition": {
# "Bool": { "aws:SecureTransport": "false" }
# }
# },
# {
# "Sid": "DenyUnencrypted",
# "Effect": "Deny",
# "Principal": "*",
# "Action": "s3:PutObject",
# "Resource": "arn:aws:s3:::my-bucket/*",
# "Condition": {
# "StringNotEquals": {
# "s3:x-amz-server-side-encryption": "aws:kms"
# }
# }
# }
# ]
# }
# ── Enable default encryption ──
aws s3api put-bucket-encryption \
--bucket my-bucket \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "aws:kms",
"KMSMasterKeyID": "arn:aws:kms:us-east-1:123:key/abc-123"
},
"BucketKeyEnabled": true
}]
}'
# ── Generate Presigned URL (temporary access) ──
aws s3 presign s3://my-bucket/reports/q4-2025.pdf \
--expires-in 3600 # 1 hour
# Python boto3:
# s3 = boto3.client("s3")
# url = s3.generate_presigned_url(
# "get_object",
# Params={"Bucket": "my-bucket", "Key": "reports/q4.pdf"},
# ExpiresIn=3600 # seconds
# )
# print(url) # Share this URL — no AWS credentials needed
# ── Enable versioning (protection against accidental deletes) ──
aws s3api put-bucket-versioning \
--bucket my-bucket \
--versioning-configuration Status=Enabled
# ── Enable access logging ──
aws s3api put-bucket-logging --bucket my-bucket \
--bucket-logging-status '{
"LoggingEnabled": {
"TargetBucket": "my-logs-bucket",
"TargetPrefix": "s3-access-logs/"
}
}'
A company's S3 bucket containing customer PII was found publicly accessible — a developer had set a bucket policy with Principal: * to test and forgot to remove it. Block Public Access was not enabled. After the incident: (1) Block Public Access enabled at the account level for all buckets, (2) SCPs in AWS Organizations prevented any user from disabling it, (3) SSE-KMS encryption enforced via bucket policy, (4) S3 Access Analyzer enabled to detect any future public or cross-account access.
What is S3 Access Analyzer and how does it detect unintended public or cross-account access?
CloudWatch is AWS's monitoring and observability service with four key components:
Metrics:
- Time-series data points from AWS services (CPU, network, disk) and your applications (custom metrics).
- Standard resolution: 1-minute intervals (default, free for basic).
- High resolution: up to 1-second intervals (custom metrics, extra cost).
- Namespaces: AWS/EC2, AWS/RDS, AWS/Lambda, or custom (e.g., MyApp/Orders).
- Dimensions: key-value pairs to filter metrics (InstanceId, LoadBalancer, FunctionName).
Alarms:
- Watch a metric and trigger actions when thresholds are crossed.
- States: OK → ALARM → INSUFFICIENT_DATA.
- Actions: SNS notification, Auto Scaling policy, EC2 action (stop/terminate/reboot), Lambda invocation.
- Composite Alarms: combine multiple alarms with AND/OR logic to reduce alarm noise.
Logs:
- Log Groups: container for log streams (e.g., /aws/lambda/ProcessOrder).
- Log Streams: individual sources (each Lambda container, each EC2 instance).
- Agents: CloudWatch Agent (EC2), automatic (Lambda, ECS).
- Retention: configurable 1 day to 10 years (or never expire). Default: never expire (costly!).
Logs Insights: SQL-like query language for searching and analyzing logs. Much faster than manual searching.
# ── Put custom metric ──
aws cloudwatch put-metric-data \
--namespace "MyApp/Orders" \
--metric-name OrderCount \
--value 1 \
--unit Count \
--dimensions Environment=prod,Service=checkout
# ── Python boto3: Custom metric ──
# cloudwatch = boto3.client("cloudwatch")
# cloudwatch.put_metric_data(
# Namespace="MyApp/Orders",
# MetricData=[{
# "MetricName": "OrderProcessingTime",
# "Value": 245.5,
# "Unit": "Milliseconds",
# "Dimensions": [
# {"Name": "Service", "Value": "checkout"},
# {"Name": "Environment", "Value": "prod"}
# ]
# }]
# )
# ── Create alarm: CPU > 80% for 5 minutes ──
aws cloudwatch put-metric-alarm \
--alarm-name HighCPU \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-abc123 \
--alarm-actions arn:aws:sns:us-east-1:123:ops-alerts
# ── CloudWatch Logs Insights: Find errors ──
# fields @timestamp, @message
# | filter @message like /ERROR|Exception/
# | sort @timestamp desc
# | limit 50
# ── Logs Insights: P99 latency by API path ──
# fields @timestamp, path, latency
# | stats percentile(latency, 99) as p99,
# percentile(latency, 50) as p50,
# count() as requests
# by path
# | sort p99 desc
# ── Set log retention (default is "never expire"!) ──
aws logs put-retention-policy \
--log-group-name /aws/lambda/ProcessOrder \
--retention-in-days 30
# ── CloudFormation: Alarm + SNS ──
# Resources:
# HighCPUAlarm:
# Type: AWS::CloudWatch::Alarm
# Properties:
# AlarmName: HighCPU
# MetricName: CPUUtilization
# Namespace: AWS/EC2
# Statistic: Average
# Period: 300
# EvaluationPeriods: 2
# Threshold: 80
# ComparisonOperator: GreaterThanThreshold
# AlarmActions:
# - !Ref OpsAlertsTopic
# Dimensions:
# - Name: AutoScalingGroupName
# Value: !Ref ASG
A team had no log retention policy — Lambda log groups grew to 2TB over 18 months, costing $1,200/month in storage. Most logs were older than 30 days and never looked at. Setting retention to 30 days across all log groups reduced storage costs by 95%. They also created a Logs Insights dashboard for real-time error monitoring, replacing the old manual grep-through-console approach.
What is CloudWatch Contributor Insights and how does it help identify top contributors to operational issues?
AWS provides three core messaging/event services for decoupling architectures:
SQS (Simple Queue Service) — message queue:
- Pull-based: consumers poll the queue for messages.
- Point-to-point: each message is processed by exactly one consumer.
- Standard Queue: at-least-once delivery, best-effort ordering, nearly unlimited throughput.
- FIFO Queue: exactly-once delivery, strict ordering (300 msg/sec, or 3,000 with batching).
- Retention: 1 minute to 14 days (default 4 days).
- Dead Letter Queue (DLQ): failed messages go here after X retries.
- Best for: decoupling, work queues, buffering, rate limiting.
SNS (Simple Notification Service) — pub/sub:
- Push-based: SNS pushes messages to all subscribers.
- Fan-out: one message → multiple subscribers (SQS, Lambda, HTTP, email, SMS).
- No message retention — if subscriber is down, message is lost (unless SQS subscriber).
- Best for: fan-out pattern, notifications, alerts, event broadcasting.
EventBridge — event bus:
- Event-driven architecture: routes events based on content (rules with patterns).
- Integrates with 200+ AWS services and SaaS partners (Zendesk, Shopify, Datadog).
- Schema Registry: auto-discovers event schemas for type safety.
- Scheduling: cron/rate-based triggers (replacing CloudWatch Events).
- Best for: event-driven microservices, SaaS integrations, complex routing rules.
# ── SQS: Create queue + send message ──
aws sqs create-queue --queue-name OrderQueue
aws sqs send-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123/OrderQueue \
--message-body '{"orderId":"123","amount":99.99}'
# Receive + process + delete
# msgs = sqs.receive_message(QueueUrl=url, MaxNumberOfMessages=10)
# for msg in msgs["Messages"]:
# process(msg["Body"])
# sqs.delete_message(QueueUrl=url, ReceiptHandle=msg["ReceiptHandle"])
# ── SNS → SQS Fan-out pattern ──
# SNS Topic: "OrderEvents"
# ├── SQS: InventoryQueue (update stock)
# ├── SQS: EmailQueue (send confirmation)
# ├── SQS: AnalyticsQueue (track metrics)
# └── Lambda: FraudCheck (real-time)
#
# One publish → 4 subscribers process independently
aws sns create-topic --name OrderEvents
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123:OrderEvents \
--protocol sqs \
--notification-endpoint arn:aws:sqs:us-east-1:123:InventoryQueue
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:123:OrderEvents \
--message '{"orderId":"123","status":"placed"}'
# ── EventBridge: Content-based routing ──
# Rule: Route "order.placed" events to Lambda
aws events put-rule \
--name ProcessNewOrders \
--event-pattern '{
"source": ["com.myapp.orders"],
"detail-type": ["OrderPlaced"],
"detail": {
"amount": [{"numeric": [">", 100]}]
}
}'
aws events put-targets --rule ProcessNewOrders \
--targets "Id=1,Arn=arn:aws:lambda:us-east-1:123:function:ProcessOrder"
# ── Comparison ──
# Feature | SQS | SNS | EventBridge
# Model | Queue (pull) | Pub/Sub (push) | Event Bus (push)
# Consumers | 1 per message | Many (fan-out) | Many (rules)
# Retention | Up to 14 days | None | Replay (up to 24h)
# Ordering | FIFO available | FIFO available | Ordered per rule
# Routing | None | Topic filter | Content-based rules
# Best for | Work queues | Fan-out/alerts | Event architecture
An e-commerce app had a monolithic order handler that sent emails, updated inventory, charged payments, and logged analytics — all synchronously. Any failure caused the entire order to fail. They decoupled it: order service publishes to SNS topic → SNS fans out to 4 SQS queues (email, inventory, payment, analytics). Each queue is processed independently by its own Lambda. If email service is down, orders still complete — email queue buffers messages for retry.
What is the SNS+SQS fan-out pattern? Why is it preferred over direct SNS → Lambda fan-out?
DynamoDB is a fully managed NoSQL key-value and document database. Single-digit millisecond latency at any scale.
Primary Key Design:
- Partition Key (PK) only — simple primary key. PK determines the physical partition where data is stored.
- Partition Key + Sort Key (SK) — composite primary key. Same PK = same partition, SK orders items within. Enables range queries.
- Key design is the most critical DynamoDB decision — it determines query patterns and performance.
Secondary Indexes:
- GSI (Global Secondary Index): different partition key + optional sort key. Separate throughput. Eventually consistent. Up to 20 per table.
- LSI (Local Secondary Index): same partition key, different sort key. Shares table throughput. Strongly consistent option. Must be created at table creation. Up to 5 per table.
Capacity Modes:
- On-Demand: pay-per-request. Auto-scales instantly. Best for unpredictable or new workloads. More expensive per-request.
- Provisioned: you set Read/Write Capacity Units (RCUs/WCUs). Cheaper at steady-state. Use Auto Scaling. 1 WCU = 1 write/sec (up to 1 KB). 1 RCU = 1 strongly consistent read/sec (up to 4 KB) or 2 eventually consistent.
Single-Table Design: store multiple entity types in one table using generic PK/SK names. Reduces joins (which DynamoDB doesn't support).
# ── Create DynamoDB table with composite key ──
aws dynamodb create-table \
--table-name Orders \
--key-schema \
AttributeName=customerId,KeyType=HASH \
AttributeName=orderId,KeyType=RANGE \
--attribute-definitions \
AttributeName=customerId,AttributeType=S \
AttributeName=orderId,AttributeType=S \
AttributeName=status,AttributeType=S \
--billing-mode PAY_PER_REQUEST \
--global-secondary-indexes '[{
"IndexName": "StatusIndex",
"KeySchema": [
{"AttributeName": "status", "KeyType": "HASH"},
{"AttributeName": "orderId", "KeyType": "RANGE"}
],
"Projection": {"ProjectionType": "ALL"}
}]'
# ── Write an item ──
aws dynamodb put-item --table-name Orders --item '{
"customerId": {"S": "CUST-001"},
"orderId": {"S": "ORD-2026-001"},
"amount": {"N": "99.99"},
"status": {"S": "shipped"},
"items": {"L": [{"S": "Widget"}, {"S": "Gadget"}]}
}'
# ── Query: Get all orders for a customer ──
aws dynamodb query --table-name Orders \
--key-condition-expression "customerId = :cid" \
--expression-attribute-values '{":cid": {"S": "CUST-001"}}'
# ── Query: Get orders in a date range (SK) ──
aws dynamodb query --table-name Orders \
--key-condition-expression "customerId = :cid AND orderId BETWEEN :start AND :end" \
--expression-attribute-values '{
":cid": {"S": "CUST-001"},
":start": {"S": "ORD-2026-001"},
":end": {"S": "ORD-2026-100"}
}'
# ── Query GSI: Get all shipped orders ──
aws dynamodb query --table-name Orders \
--index-name StatusIndex \
--key-condition-expression "status = :s" \
--expression-attribute-values '{":s": {"S": "shipped"}}'
# ── Python boto3: Single-table design pattern ──
# table.put_item(Item={
# "PK": "CUSTOMER#C001",
# "SK": "ORDER#2026-01-15#O001",
# "type": "order",
# "amount": Decimal("99.99")
# })
# table.put_item(Item={
# "PK": "CUSTOMER#C001",
# "SK": "PROFILE",
# "type": "customer",
# "name": "Alice",
# "email": "alice@example.com"
# })
An app used a customer email as the partition key. One enterprise customer had 5 million records — creating a "hot partition" that throttled the entire table. The fix: changed PK to customerId (UUID, even distribution) + SK for time-based ordering. Added a GSI on email for lookup queries. Hot partition problem disappeared. They also switched from Provisioned to On-Demand mode during migration since traffic was unpredictable.
What is DynamoDB single-table design? Why is it recommended over multiple tables?
AWS offers two container orchestration services:
ECS (Elastic Container Service):
- AWS-native container orchestration. Simpler, tightly integrated with AWS services.
- Uses Task Definitions (JSON) to define containers — image, CPU, memory, ports, env vars, logging.
- Services: manage desired count, rolling updates, load balancing.
- Deep ALB integration, CloudWatch logging, IAM task roles.
- Best for: teams that want simplicity and are AWS-centric.
EKS (Elastic Kubernetes Service):
- Managed Kubernetes control plane. Uses standard K8s APIs, manifests, tools (kubectl, Helm).
- Portable — same manifests work on any Kubernetes cluster (GKE, AKS, on-prem).
- Larger ecosystem — thousands of K8s tools, operators, CRDs.
- More complex to operate. More expensive ($0.10/hr for control plane).
- Best for: teams with K8s expertise, multi-cloud, complex orchestration needs.
Launch Types (both ECS and EKS):
- Fargate (serverless): AWS manages the underlying EC2 instances. You define CPU/memory per task. No patching, no capacity planning. Pay per task.
- EC2: you manage EC2 instances in an ASG. More control (GPU, custom AMI, lower cost for steady-state). You handle patching, scaling.
# ── ECS Task Definition (simplified) ──
# {
# "family": "web-app",
# "networkMode": "awsvpc",
# "requiresCompatibilities": ["FARGATE"],
# "cpu": "512",
# "memory": "1024",
# "executionRoleArn": "arn:aws:iam::123:role/ecsTaskExecutionRole",
# "taskRoleArn": "arn:aws:iam::123:role/appTaskRole",
# "containerDefinitions": [{
# "name": "web",
# "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/web:latest",
# "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
# "logConfiguration": {
# "logDriver": "awslogs",
# "options": {
# "awslogs-group": "/ecs/web-app",
# "awslogs-region": "us-east-1",
# "awslogs-stream-prefix": "web"
# }
# },
# "environment": [
# {"name": "DB_HOST", "value": "mydb.cluster-xxx.rds.amazonaws.com"}
# ]
# }]
# }
# ── Create ECS Fargate Service ──
aws ecs create-service \
--cluster my-cluster \
--service-name web-service \
--task-definition web-app:1 \
--desired-count 3 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-aaa", "subnet-bbb"],
"securityGroups": ["sg-abc123"],
"assignPublicIp": "DISABLED"
}
}' \
--load-balancers '{
"targetGroupArn": "arn:aws:elasticloadbalancing:...",
"containerName": "web",
"containerPort": 8080
}'
# ── EKS: Deploy with kubectl ──
# apiVersion: apps/v1
# kind: Deployment
# metadata:
# name: web-app
# spec:
# replicas: 3
# selector:
# matchLabels:
# app: web
# template:
# spec:
# containers:
# - name: web
# image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/web:latest
# ports:
# - containerPort: 8080
# resources:
# requests:
# cpu: "256m"
# memory: "512Mi"
# ── Comparison ──
# Feature | ECS | EKS | Fargate | EC2 Launch
# Complexity | Low | High | Lowest | Medium
# Portability | AWS only | Multi-cloud | - | -
# Cost | No control plane| $0.10/hr/cluster| Pay per task | Pay per instance
# GPU | Yes (EC2) | Yes (EC2) | No | Yes
# Scaling | Auto (service) | HPA/Karpenter | Per task | ASG
A startup began with ECS Fargate — zero infrastructure management, deployed in a day. As they grew to 50+ microservices and hired Kubernetes engineers, they migrated to EKS for the richer ecosystem (Helm charts, service mesh, custom operators). They kept Fargate as the compute layer for EKS (EKS on Fargate) for dev/test environments, and used EC2 managed node groups for production (better cost control, GPU support for ML services).
What is AWS App Runner and how does it compare to ECS Fargate for simple web applications?
CloudFormation is AWS's Infrastructure as Code (IaC) service. You define resources in YAML/JSON templates, and CloudFormation provisions and manages them.
Core concepts:
- Template: YAML/JSON file describing resources, parameters, outputs, mappings, and conditions.
- Stack: a collection of AWS resources created from a template. Managed as a single unit — create, update, or delete the entire stack.
- Stack operations: Create → Update → Delete. On failure, automatic rollback to the previous state.
Key features:
- Change Sets: preview changes before applying. Shows what will be added, modified, or deleted. Prevents surprises (like accidentally replacing a database).
- Drift Detection: detects when actual resource configuration differs from the template (someone manually changed a security group in the console). Reports "IN_SYNC" or "DRIFTED".
- Nested Stacks: reusable template components. A "parent" stack includes "child" stacks. DRY principle for common infrastructure (VPC, security groups).
- Stack Sets: deploy stacks across multiple accounts and Regions from a single template. For Organizations-wide infrastructure.
Deletion Policy: controls what happens when a resource is removed from the template. Options: Delete (default), Retain (keep resource), Snapshot (create backup before deleting — RDS, EBS).
# ── CloudFormation Template (YAML) ──
# AWSTemplateFormatVersion: "2010-09-09"
# Description: Web application infrastructure
#
# Parameters:
# Environment:
# Type: String
# AllowedValues: [dev, staging, prod]
# Default: dev
# InstanceType:
# Type: String
# Default: m7g.large
#
# Conditions:
# IsProd: !Equals [!Ref Environment, prod]
#
# Resources:
# VPC:
# Type: AWS::EC2::VPC
# Properties:
# CidrBlock: 10.0.0.0/16
# Tags:
# - Key: Name
# Value: !Sub "${Environment}-vpc"
#
# Database:
# Type: AWS::RDS::DBInstance
# DeletionPolicy: Snapshot # ← Take snapshot before delete!
# Properties:
# DBInstanceClass: !If [IsProd, db.r7g.xlarge, db.t4g.medium]
# Engine: postgres
# MultiAZ: !If [IsProd, true, false]
# StorageEncrypted: true
#
# Outputs:
# VPCId:
# Value: !Ref VPC
# Export:
# Name: !Sub "${Environment}-VPCId"
# ── Create stack ──
aws cloudformation create-stack \
--stack-name my-app-prod \
--template-body file://template.yaml \
--parameters ParameterKey=Environment,ParameterValue=prod \
--capabilities CAPABILITY_IAM
# ── Create Change Set (preview before update) ──
aws cloudformation create-change-set \
--stack-name my-app-prod \
--change-set-name update-instance-type \
--template-body file://template-v2.yaml \
--parameters ParameterKey=InstanceType,ParameterValue=m7g.xlarge
# Review changes
aws cloudformation describe-change-set \
--stack-name my-app-prod \
--change-set-name update-instance-type
# Execute if safe
aws cloudformation execute-change-set \
--stack-name my-app-prod \
--change-set-name update-instance-type
# ── Drift Detection ──
aws cloudformation detect-stack-drift --stack-name my-app-prod
# Wait, then check results:
aws cloudformation describe-stack-resource-drifts \
--stack-name my-app-prod \
--stack-resource-drift-status-filters MODIFIED DELETED
# ── Nested Stack (reuse VPC template) ──
# Resources:
# NetworkStack:
# Type: AWS::CloudFormation::Stack
# Properties:
# TemplateURL: https://s3.amazonaws.com/templates/vpc.yaml
# Parameters:
# Environment: !Ref Environment
A team updated their CloudFormation template and ran an update without a Change Set. The update replaced their production RDS instance (a "replacement" update because they changed the engine version incorrectly). Data was lost because DeletionPolicy was set to "Delete" (the default). After the incident: (1) mandatory Change Sets for all production updates, (2) DeletionPolicy: Snapshot on all databases and EBS volumes, (3) weekly drift detection to catch manual console changes.
What are the differences between CloudFormation and Terraform? When would you choose one over the other?
AWS provides three ways to connect VPCs and services privately:
VPC Peering:
- Direct, one-to-one connection between two VPCs using private IPs.
- Works cross-account and cross-Region.
- No single point of failure — uses AWS backbone, not the internet.
- Non-transitive: VPC-A peers with VPC-B, VPC-B peers with VPC-C → A cannot reach C through B.
- CIDR ranges cannot overlap.
- Best for: small number of VPCs (< 10). N VPCs need N×(N-1)/2 peering connections (full mesh).
Transit Gateway (TGW):
- Hub-and-spoke network — one TGW connects hundreds of VPCs, VPNs, and Direct Connect.
- Transitive routing: all attached VPCs can communicate through the TGW.
- Supports route tables for segmentation (prod VPCs can't reach dev VPCs).
- Cross-Region peering between Transit Gateways.
- Best for: large organizations with many VPCs. Replaces the full mesh of VPC Peering.
PrivateLink (VPC Endpoints):
- Expose a specific service (not the whole VPC) to other VPCs or accounts.
- Interface Endpoint: ENI with private IP in your VPC → accesses AWS services (S3, DynamoDB, SQS) or your own services privately.
- Gateway Endpoint: route table entry for S3 and DynamoDB only (free).
- No VPC CIDR overlap issues. Unidirectional: consumer → provider.
- Best for: SaaS providers exposing services, accessing AWS services without NAT Gateway.
# ── VPC Peering ──
aws ec2 create-vpc-peering-connection \
--vpc-id vpc-aaa \
--peer-vpc-id vpc-bbb \
--peer-owner-id 987654321012 # Cross-account
# Accept in the other account
aws ec2 accept-vpc-peering-connection \
--vpc-peering-connection-id pcx-abc123
# Add routes in BOTH VPCs
aws ec2 create-route --route-table-id rtb-aaa \
--destination-cidr-block 10.1.0.0/16 \
--vpc-peering-connection-id pcx-abc123
# ── Transit Gateway ──
aws ec2 create-transit-gateway \
--description "Central Hub" \
--options DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable
# Attach VPCs
aws ec2 create-transit-gateway-vpc-attachment \
--transit-gateway-id tgw-abc123 \
--vpc-id vpc-aaa \
--subnet-ids subnet-aaa1 subnet-aaa2
# All attached VPCs can now communicate through TGW
# Use TGW route tables for segmentation
# ── PrivateLink: Access S3 without NAT (Gateway Endpoint) ──
aws ec2 create-vpc-endpoint \
--vpc-id vpc-aaa \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-private # Free! No NAT needed for S3
# ── PrivateLink: Interface Endpoint for SQS ──
aws ec2 create-vpc-endpoint \
--vpc-id vpc-aaa \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.sqs \
--subnet-ids subnet-private1 \
--security-group-ids sg-endpoint
# ── Comparison ──
# Feature | VPC Peering | Transit Gateway | PrivateLink
# Topology | 1:1 | Hub-and-spoke | Service endpoint
# Transitive | No | Yes | N/A
# Scale | < 10 VPCs | Hundreds of VPCs | Per-service
# Overlap CIDRs | No | No | Yes
# Cross-Region | Yes | Yes (peering) | No (same Region)
# Cost | Data transfer | $0.05/hr + data | $0.01/hr + data
A company with 5 VPCs started with VPC Peering (10 peering connections for full mesh). When they grew to 30 VPCs, managing 435 peering connections was impossible. They migrated to Transit Gateway — one hub connecting all VPCs. They segmented traffic using TGW route tables: production VPCs could reach shared services but not development VPCs. They also added Gateway Endpoints for S3 and DynamoDB, saving $2,000/month in NAT Gateway data processing fees.
What is AWS Direct Connect and when would you use it instead of VPN over the internet?
Multi-region architecture deploys your application across two or more AWS Regions for low latency, disaster recovery, or compliance.
Active-Passive (Pilot Light / Warm Standby):
- Primary Region: handles all traffic.
- Secondary Region: infrastructure exists but receives no traffic until failover.
- Pilot Light: only critical components running (database replica). Compute scaled to zero. Cheapest, slowest recovery (hours).
- Warm Standby: reduced-capacity copy of production. Can take traffic within minutes.
- Use Route 53 Failover routing to switch DNS on primary failure.
Active-Active:
- Both Regions serve live traffic simultaneously.
- Use Route 53 Latency-based routing — users go to the nearest Region.
- Requires data replication: DynamoDB Global Tables, Aurora Global Database, S3 Cross-Region Replication.
- Conflict resolution: "last writer wins" (DynamoDB) or application-level logic.
- Most resilient but most complex and expensive.
Key services for multi-region:
- DynamoDB Global Tables: multi-region, multi-master. Sub-second replication.
- Aurora Global Database: 1 primary Region (read/write) + up to 5 secondary Regions (read-only, < 1s lag). Failover in < 1 minute.
- S3 Cross-Region Replication: async copy of objects to another Region.
- CloudFront: global CDN, caches in 400+ Edge Locations.
# ── DynamoDB Global Table (active-active database) ──
aws dynamodb create-table \
--table-name UserSessions \
--key-schema AttributeName=userId,KeyType=HASH \
--attribute-definitions AttributeName=userId,AttributeType=S \
--billing-mode PAY_PER_REQUEST \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
# Add replica in eu-west-1
aws dynamodb update-table --table-name UserSessions \
--replica-updates '[{"Create":{"RegionName":"eu-west-1"}}]'
# ── Aurora Global Database ──
aws rds create-global-cluster \
--global-cluster-identifier my-global-db \
--source-db-cluster-identifier arn:aws:rds:us-east-1:123:cluster:my-aurora \
--engine aurora-postgresql
# Add secondary Region
aws rds create-db-cluster \
--db-cluster-identifier my-aurora-secondary \
--engine aurora-postgresql \
--global-cluster-identifier my-global-db \
--region eu-west-1
# ── S3 Cross-Region Replication ──
aws s3api put-bucket-replication --bucket source-bucket \
--replication-configuration '{
"Role": "arn:aws:iam::123:role/S3ReplicationRole",
"Rules": [{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::destination-bucket-eu",
"StorageClass": "STANDARD_IA"
}
}]
}'
# ── Route 53: Latency-based routing (active-active) ──
# US users → us-east-1 ALB (30ms)
# EU users → eu-west-1 ALB (25ms)
# Health checks on both → automatic failover if one Region fails
# ── Route 53: Failover routing (active-passive) ──
# Primary: us-east-1 ALB + health check
# Secondary: eu-west-1 ALB (standby)
# Primary fails → DNS switches to secondary
# ── Architecture diagram ──
# Active-Active:
# Users → Route 53 (Latency) → us-east-1 ALB / eu-west-1 ALB
# ↓ ↓
# Aurora Primary Aurora Secondary
# ← replication →
# DynamoDB Global Table (both write)
A fintech company required < 100ms latency globally and < 1 minute RTO (Recovery Time Objective). They deployed active-active in us-east-1 and eu-west-1: DynamoDB Global Tables for session data (multi-master, sub-second replication), Aurora Global Database for transactional data (primary in us-east-1, read-only secondary in eu-west-1 with < 1s lag). Route 53 latency-based routing sent users to the nearest Region. During a us-east-1 outage, Aurora Global Database promoted the eu-west-1 secondary to primary in 45 seconds — users experienced only brief read-only mode.
What is the difference between RPO and RTO? How do you choose a DR strategy based on these requirements?
AWS Organizations centrally manages multiple AWS accounts as a single unit.
Key concepts:
- Management Account (root): creates the organization, manages billing, applies policies. Should NOT run workloads.
- Member Accounts: individual AWS accounts for workloads, environments, or teams.
- Organizational Units (OUs): hierarchical grouping of accounts (like folders). SCPs cascade down.
Recommended OU structure:
- Security OU: log archive account, security tooling account (GuardDuty, Config).
- Infrastructure OU: shared services (DNS, networking, CI/CD).
- Workloads OU: prod, staging, dev sub-OUs with separate accounts.
- Sandbox OU: experimentation accounts with spending limits.
Service Control Policies (SCPs):
- Guardrails — define the maximum permissions for accounts in an OU.
- SCPs don't grant permissions — they restrict what IAM policies can do.
- Applied to OUs or accounts. Cascade to all child OUs/accounts.
- The management account is never affected by SCPs.
- Common SCPs: deny Region access (restrict to specific Regions), deny root user actions, deny disabling CloudTrail/GuardDuty, deny public S3.
# ── SCP: Deny all Regions except us-east-1 and eu-west-1 ──
# {
# "Version": "2012-10-17",
# "Statement": [{
# "Sid": "DenyOtherRegions",
# "Effect": "Deny",
# "Action": "*",
# "Resource": "*",
# "Condition": {
# "StringNotEquals": {
# "aws:RequestedRegion": ["us-east-1", "eu-west-1"]
# },
# "ForAnyValue:StringNotLike": {
# "aws:PrincipalArn": [
# "arn:aws:iam::*:role/OrganizationAdmin"
# ]
# }
# }
# }]
# }
# ── SCP: Prevent disabling CloudTrail ──
# {
# "Version": "2012-10-17",
# "Statement": [{
# "Sid": "ProtectCloudTrail",
# "Effect": "Deny",
# "Action": [
# "cloudtrail:StopLogging",
# "cloudtrail:DeleteTrail",
# "cloudtrail:UpdateTrail"
# ],
# "Resource": "*"
# }]
# }
# ── SCP: Prevent public S3 ──
# {
# "Version": "2012-10-17",
# "Statement": [{
# "Sid": "DenyS3PublicAccess",
# "Effect": "Deny",
# "Action": "s3:PutBucketPublicAccessBlock",
# "Resource": "*",
# "Condition": {
# "StringNotEquals": {
# "s3:PublicAccessBlockConfiguration/BlockPublicAcls": "true"
# }
# }
# }]
# }
# ── Create OU structure ──
aws organizations create-organizational-unit \
--parent-id r-xxxx --name "Security"
aws organizations create-organizational-unit \
--parent-id r-xxxx --name "Workloads"
aws organizations create-organizational-unit \
--parent-id ou-workloads --name "Production"
aws organizations create-organizational-unit \
--parent-id ou-workloads --name "Development"
# ── Attach SCP to an OU ──
aws organizations attach-policy \
--policy-id p-abc123 \
--target-id ou-workloads
# ── OU hierarchy ──
# Root
# ├── Security OU (log archive, security tools)
# ├── Infrastructure OU (networking, CI/CD)
# ├── Workloads OU
# │ ├── Production OU (SCP: deny risky actions)
# │ ├── Staging OU
# │ └── Development OU (SCP: restrict instance types)
# └── Sandbox OU (SCP: spending limit, Region restrict)
A company had all teams sharing a single AWS account — 200 developers. A junior developer accidentally deleted a production DynamoDB table. After the incident, they moved to AWS Organizations: separate accounts for prod, staging, dev, and security. SCPs on the Production OU prevented deleting databases, disabling CloudTrail, or launching instances in unauthorized Regions. A Sandbox OU let developers experiment freely with a $100/month budget. Security account centralized CloudTrail logs from all accounts.
What is AWS Control Tower and how does it automate multi-account setup with Organizations?
API Gateway is a fully managed service for creating, publishing, and managing APIs at any scale.
API Types:
- REST API: full-featured. Supports API keys, usage plans, resource policies, request/response transformation, caching, WAF, private APIs. ~$3.50/million requests.
- HTTP API: simpler, faster, cheaper. Supports JWT authorization, CORS. No caching, no usage plans, no WAF. ~$1.00/million requests. 70% cheaper.
- WebSocket API: real-time two-way communication (chat, gaming, notifications).
Throttling:
- Account-level: 10,000 requests/sec across all APIs (soft limit).
- Stage-level: configurable per API stage (e.g., prod vs dev).
- Method-level: throttle specific routes (e.g., POST /orders at 100 req/sec).
- Usage Plans + API Keys: rate limit per customer (free tier: 100 req/day, paid: 10,000 req/day).
Caching (REST API only):
- Caches API responses for a configurable TTL (300 seconds default).
- Reduces backend calls and improves latency.
- Cache size: 0.5 GB to 237 GB. Costs $0.02-$3.80/hr.
Authorization: IAM, Cognito User Pools, Lambda Authorizer (custom logic), JWT (HTTP API).
# ── CloudFormation: REST API with Lambda backend ──
# Resources:
# MyAPI:
# Type: AWS::ApiGateway::RestApi
# Properties:
# Name: OrderAPI
# Description: Order management API
#
# OrdersResource:
# Type: AWS::ApiGateway::Resource
# Properties:
# RestApiId: !Ref MyAPI
# ParentId: !GetAtt MyAPI.RootResourceId
# PathPart: orders
#
# PostOrder:
# Type: AWS::ApiGateway::Method
# Properties:
# RestApiId: !Ref MyAPI
# ResourceId: !Ref OrdersResource
# HttpMethod: POST
# AuthorizationType: COGNITO_USER_POOLS
# AuthorizerId: !Ref CognitoAuth
# Integration:
# Type: AWS_PROXY
# IntegrationHttpMethod: POST
# Uri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${OrderFn.Arn}/invocations"
# ── HTTP API (simpler, cheaper) ──
aws apigatewayv2 create-api \
--name OrderAPI \
--protocol-type HTTP \
--target arn:aws:lambda:us-east-1:123:function:ProcessOrder
# ── Enable caching (REST API only) ──
aws apigateway update-stage \
--rest-api-id abc123 \
--stage-name prod \
--patch-operations \
op=replace,path=/cacheClusterEnabled,value=true \
op=replace,path=/cacheClusterSize,value=0.5
# ── Usage Plan + API Key (per-customer rate limiting) ──
aws apigateway create-usage-plan --name "FreeTier" \
--throttle burstLimit=10,rateLimit=5 \
--quota limit=1000,period=MONTH \
--api-stages apiId=abc123,stage=prod
aws apigateway create-api-key --name "customer-001" --enabled
aws apigateway create-usage-plan-key \
--usage-plan-id plan123 \
--key-id key123 \
--key-type API_KEY
# ── Comparison ──
# Feature | REST API | HTTP API
# Cost | $3.50/million | $1.00/million
# Caching | ✅ | ❌
# Usage Plans | ✅ | ❌
# WAF | ✅ | ❌
# Transformation | ✅ (VTL templates) | ❌
# Auth | IAM, Cognito, Lambda | JWT, IAM, Lambda
# Private API | ✅ | ❌
# Performance | ~29ms overhead | ~10ms overhead
A SaaS company needed to expose their API to customers with different pricing tiers. They used REST API with Usage Plans: free tier (100 requests/day, 5 req/sec burst), basic ($29/month, 10K requests/day), and enterprise ($299/month, 100K requests/day). API caching with 60-second TTL reduced Lambda invocations by 70% for read-heavy endpoints. They added a Lambda Authorizer to validate JWT tokens and inject tenant context into requests.
What is a Lambda Authorizer and how does it compare to Cognito User Pools for API authorization?
ElastiCache is a managed in-memory data store for caching, session management, and real-time analytics.
Redis vs Memcached:
- Redis:
- Rich data structures: strings, lists, sets, sorted sets, hashes, streams, geospatial.
- Persistence: snapshots (RDB) and append-only file (AOF).
- Replication: primary-replica with automatic failover (Multi-AZ).
- Pub/Sub, Lua scripting, transactions.
- Cluster mode: shards data across up to 500 nodes (up to 340 TB).
- Best for: most use cases — caching, sessions, leaderboards, rate limiting, real-time analytics.
- Memcached:
- Simple key-value only. Multi-threaded (better per-node throughput for simple operations).
- No persistence, no replication, no failover.
- Node failure = data loss.
- Best for: simple caching where data loss is acceptable. Horizontally scalable (add/remove nodes).
Caching strategies:
- Lazy Loading (Cache-Aside): application checks cache → miss → reads from DB → writes to cache. Pros: only caches what's needed. Con: initial miss penalty, stale data possible.
- Write-Through: every write goes to cache AND DB. Pros: cache always fresh. Con: write latency, caches unused data.
- Write-Behind: write to cache → async write to DB later. Fastest writes but risk of data loss if cache node fails.
- TTL (Time-to-Live): always set TTL — prevents stale data and manages memory.
# ── Create Redis cluster (cluster mode disabled) ──
aws elasticache create-replication-group \
--replication-group-id my-redis \
--replication-group-description "App Cache" \
--engine redis \
--cache-node-type cache.r7g.large \
--num-cache-clusters 3 \
--multi-az-enabled \
--automatic-failover-enabled \
--at-rest-encryption-enabled \
--transit-encryption-enabled \
--cache-subnet-group-name my-cache-subnets \
--security-group-ids sg-redis
# ── Create Redis cluster (cluster mode enabled — sharding) ──
aws elasticache create-replication-group \
--replication-group-id my-redis-cluster \
--replication-group-description "Sharded Cache" \
--engine redis \
--cache-node-type cache.r7g.large \
--num-node-groups 3 \
--replicas-per-node-group 2 \
--cluster-enabled
# ── Python: Lazy Loading (Cache-Aside) pattern ──
# import redis, json, boto3
#
# r = redis.Redis(host="my-redis.xxx.cache.amazonaws.com", port=6379, ssl=True)
# dynamodb = boto3.resource("dynamodb")
# table = dynamodb.Table("Products")
#
# def get_product(product_id):
# # 1. Check cache
# cached = r.get(f"product:{product_id}")
# if cached:
# return json.loads(cached) # Cache HIT
#
# # 2. Cache MISS — read from DB
# response = table.get_item(Key={"productId": product_id})
# product = response.get("Item")
#
# # 3. Write to cache with TTL
# if product:
# r.setex(f"product:{product_id}", 3600, json.dumps(product))
#
# return product
#
# def update_product(product_id, data):
# # Write-through: update DB + invalidate cache
# table.put_item(Item=data)
# r.delete(f"product:{product_id}") # Invalidate, not update
# ── Comparison ──
# Feature | Redis | Memcached
# Data types | Rich (lists,sets) | String only
# Persistence | Yes (RDB+AOF) | No
# Replication | Yes (Multi-AZ) | No
# Failover | Automatic | None (data lost)
# Pub/Sub | Yes | No
# Cluster mode | Yes (sharding) | Yes (hash-based)
# Threading | Single-threaded | Multi-threaded
# Max data | 340 TB (cluster) | Per-node only
A product catalog API had 500ms response times due to repeated DynamoDB queries. They added ElastiCache Redis with lazy loading: cache product data with 1-hour TTL. Cache hit rate reached 95% within a day — average response time dropped to 5ms. For flash sales, they used write-through caching to pre-warm the cache before the sale started. Redis sorted sets powered a real-time leaderboard for a gamified promotion with zero additional database load.
What is ElastiCache Serverless and how does it differ from provisioned clusters?
AWS KMS (Key Management Service) manages encryption keys for encrypting your data across AWS services.
Key types:
- AWS Managed Keys (aws/s3, aws/ebs): AWS creates and rotates them. Free. Limited control.
- Customer Managed Keys (CMK): you create, manage, define policies. $1/month + $0.03/10K API calls. Full control over rotation, deletion, cross-account access.
- AWS Owned Keys: used internally by AWS services. Not visible in your account.
Envelope Encryption:
- KMS generates a Data Key (plaintext + encrypted copy).
- Your app encrypts data with the plaintext data key (locally, fast).
- Stores the encrypted data key alongside the ciphertext.
- Plaintext data key is discarded from memory.
- To decrypt: send encrypted data key to KMS → get plaintext data key → decrypt data locally.
- Benefit: KMS only handles the small data key (< 4 KB), not your large data. Faster, cheaper.
Key Rotation: automatic annual rotation for CMKs. Old key material is kept for decryption of previously encrypted data. No re-encryption needed.
Secrets Manager:
- Stores database passwords, API keys, tokens, certificates securely.
- Automatic rotation: rotates secrets on a schedule (e.g., every 30 days) using a Lambda function.
- Built-in rotation for RDS, Redshift, DocumentDB credentials.
- SDKs retrieve secrets at runtime — no secrets in code or config files.
# ── Create a KMS key ──
aws kms create-key \
--description "Application data encryption" \
--key-usage ENCRYPT_DECRYPT \
--origin AWS_KMS
# Enable automatic key rotation
aws kms enable-key-rotation --key-id abc-123-def
# ── Envelope encryption with boto3 ──
# import boto3, base64
# from cryptography.fernet import Fernet
#
# kms = boto3.client("kms")
#
# # 1. Generate data key
# response = kms.generate_data_key(
# KeyId="alias/my-app-key",
# KeySpec="AES_256"
# )
# plaintext_key = response["Plaintext"]
# encrypted_key = response["CiphertextBlob"]
#
# # 2. Encrypt data locally (fast, no KMS API call)
# f = Fernet(base64.urlsafe_b64encode(plaintext_key[:32]))
# ciphertext = f.encrypt(b"sensitive data here")
#
# # 3. Store encrypted_key + ciphertext together
# # 4. Delete plaintext_key from memory
# del plaintext_key
#
# # To decrypt:
# # 1. Send encrypted_key to KMS → get plaintext key
# # 2. Decrypt ciphertext locally with plaintext key
# ── Secrets Manager: Store a database password ──
aws secretsmanager create-secret \
--name prod/myapp/db-password \
--description "Production DB credentials" \
--secret-string '{"username":"admin","password":"SuperS3cret!","host":"mydb.cluster-xxx.rds.amazonaws.com","port":"5432","dbname":"appdb"}'
# ── Enable automatic rotation (every 30 days) ──
aws secretsmanager rotate-secret \
--secret-id prod/myapp/db-password \
--rotation-lambda-arn arn:aws:lambda:us-east-1:123:function:SecretRotation \
--rotation-rules AutomaticallyAfterDays=30
# ── Python: Retrieve secret at runtime ──
# import boto3, json
#
# def get_db_credentials():
# client = boto3.client("secretsmanager")
# response = client.get_secret_value(SecretId="prod/myapp/db-password")
# return json.loads(response["SecretString"])
#
# creds = get_db_credentials()
# conn = psycopg2.connect(
# host=creds["host"],
# user=creds["username"],
# password=creds["password"],
# dbname=creds["dbname"]
# )
# ── KMS Key Policy: Cross-account access ──
# {
# "Statement": [{
# "Sid": "AllowCrossAccountDecrypt",
# "Effect": "Allow",
# "Principal": {"AWS": "arn:aws:iam::987654321:root"},
# "Action": ["kms:Decrypt", "kms:DescribeKey"],
# "Resource": "*"
# }]
# }
A company stored database passwords in environment variables on EC2 instances. When an instance was compromised, the attacker found credentials in /proc/self/environ. After the incident: (1) passwords moved to Secrets Manager with automatic 30-day rotation, (2) application retrieves credentials at runtime via SDK, (3) EC2 instance profile has secretsmanager:GetSecretValue permission only for its own secrets, (4) all data at rest encrypted with customer-managed KMS keys (audit trail in CloudTrail). A leaked credential is now useless within 30 days.
What is the difference between Secrets Manager and Systems Manager Parameter Store for storing secrets?
AWS provides a fully managed CI/CD pipeline using three services:
CodePipeline — orchestrator:
- Defines the stages of your pipeline: Source → Build → Test → Deploy.
- Integrates with GitHub, CodeCommit, S3 (source), CodeBuild (build/test), CodeDeploy, ECS, Lambda, CloudFormation (deploy).
- Triggers automatically on code push.
- Supports manual approval stages (for production deployments).
CodeBuild — build/test service:
- Fully managed build environment. No servers to manage.
- Uses a buildspec.yml file defining phases: install, pre_build, build, post_build.
- Supports any language/framework. Uses Docker containers for builds.
- Produces artifacts (JAR, ZIP, Docker image) stored in S3 or ECR.
- Pay per build minute.
CodeDeploy — deployment service:
- Deploys to EC2, ECS, Lambda, or on-premises.
- Deployment strategies:
- In-place (EC2): update instances one by one. Downtime risk.
- Blue/Green (EC2/ECS): create new environment → switch traffic → terminate old. Zero downtime.
- Canary (Lambda/ECS): send 10% traffic to new version → wait → shift 100%.
- Linear (Lambda/ECS): shift traffic in equal increments every N minutes.
- Automatic rollback: if CloudWatch alarms fire during deployment → auto-rollback.
# ── buildspec.yml (CodeBuild) ──
# version: 0.2
# phases:
# install:
# runtime-versions:
# nodejs: 20
# pre_build:
# commands:
# - npm ci
# - echo "Running tests..."
# - npm test
# build:
# commands:
# - npm run build
# - echo "Building Docker image..."
# - docker build -t $ECR_REPO:$CODEBUILD_RESOLVED_SOURCE_VERSION .
# post_build:
# commands:
# - docker push $ECR_REPO:$CODEBUILD_RESOLVED_SOURCE_VERSION
# - printf '{"ImageURI":"%s"}' $ECR_REPO:$CODEBUILD_RESOLVED_SOURCE_VERSION > imageDetail.json
# artifacts:
# files:
# - imageDetail.json
# - appspec.yaml
# - taskdef.json
# ── appspec.yml (CodeDeploy for ECS Blue/Green) ──
# version: 0.0
# Resources:
# - TargetService:
# Type: AWS::ECS::Service
# Properties:
# TaskDefinition: <TASK_DEFINITION>
# LoadBalancerInfo:
# ContainerName: "web"
# ContainerPort: 8080
# Hooks:
# - BeforeInstall: "LambdaFunctionToValidateBeforeInstall"
# - AfterAllowTestTraffic: "LambdaFunctionToValidateTestTraffic"
# - AfterAllowTraffic: "LambdaFunctionToValidateAfterTraffic"
# ── CloudFormation: CodePipeline ──
# Resources:
# Pipeline:
# Type: AWS::CodePipeline::Pipeline
# Properties:
# Stages:
# - Name: Source
# Actions:
# - Name: GitHub
# ActionTypeId:
# Category: Source
# Provider: CodeStarSourceConnection
# Configuration:
# ConnectionArn: !Ref GitHubConnection
# FullRepositoryId: "myorg/myapp"
# BranchName: main
# OutputArtifacts: [{Name: SourceOutput}]
#
# - Name: Build
# Actions:
# - Name: BuildAndTest
# ActionTypeId:
# Category: Build
# Provider: CodeBuild
# Configuration:
# ProjectName: !Ref BuildProject
# InputArtifacts: [{Name: SourceOutput}]
# OutputArtifacts: [{Name: BuildOutput}]
#
# - Name: Approval
# Actions:
# - Name: ManualApproval
# ActionTypeId:
# Category: Approval
# Provider: Manual
#
# - Name: Deploy
# Actions:
# - Name: DeployToECS
# ActionTypeId:
# Category: Deploy
# Provider: ECS
# Configuration:
# ClusterName: !Ref Cluster
# ServiceName: !Ref Service
A team deployed to production by SSH-ing into servers and running git pull. Deployment took 2 hours, had no rollback mechanism, and caused 30 minutes of downtime every release. They implemented CodePipeline: push to main → CodeBuild runs tests + builds Docker image → pushes to ECR → CodeDeploy does Blue/Green ECS deployment. Deployment time dropped to 15 minutes with zero downtime. A bad deployment auto-rolled back when the ALB health check alarm fired in CloudWatch.
How does CodePipeline compare to GitHub Actions or Jenkins for AWS-based CI/CD?
CloudFront is AWS's Content Delivery Network (CDN) with 400+ Edge Locations worldwide.
How it works:
- User requests content (e.g., https://cdn.example.com/image.jpg).
- Request goes to the nearest Edge Location.
- Cache hit: return cached content immediately (< 10ms).
- Cache miss: Edge Location fetches from Origin (S3, ALB, custom HTTP server), caches it, returns to user.
Origins: S3 bucket, ALB, API Gateway, custom HTTP server, MediaStore.
OAC (Origin Access Control):
- Replaces the older OAI (Origin Access Identity).
- Ensures S3 bucket is only accessible through CloudFront, not directly.
- S3 bucket policy allows only the CloudFront distribution's OAC.
Cache Behaviors:
- Rules that match URL patterns (e.g., /api/*, /images/*, /static/*).
- Each behavior can have a different origin, TTL, and caching policy.
- /api/* → ALB (no caching), /static/* → S3 (cache 1 year).
Lambda@Edge / CloudFront Functions:
- Run code at Edge Locations on viewer request/response or origin request/response.
- CloudFront Functions: lightweight (< 1ms), for header manipulation, URL rewrites, redirects.
- Lambda@Edge: full Lambda (up to 30s), for auth, A/B testing, dynamic content generation.
Invalidation: removes cached content before TTL expires. Use sparingly (costs $0.005/path after 1,000 free). Better: use versioned file names (/css/style.v2.css).
# ── CloudFormation: CloudFront + S3 with OAC ──
# Resources:
# Distribution:
# Type: AWS::CloudFront::Distribution
# Properties:
# DistributionConfig:
# Origins:
# - Id: S3Origin
# DomainName: !GetAtt AssetsBucket.RegionalDomainName
# OriginAccessControlId: !GetAtt OAC.Id
# S3OriginConfig:
# OriginAccessIdentity: ""
# - Id: ALBOrigin
# DomainName: !GetAtt ALB.DNSName
# CustomOriginConfig:
# OriginProtocolPolicy: https-only
# DefaultCacheBehavior:
# TargetOriginId: S3Origin
# ViewerProtocolPolicy: redirect-to-https
# CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6 # CachingOptimized
# Compress: true
# CacheBehaviors:
# - PathPattern: "/api/*"
# TargetOriginId: ALBOrigin
# ViewerProtocolPolicy: https-only
# CachePolicyId: 4135ea2d-6df8-44a3-9df3-4b5a84be39ad # CachingDisabled
# Aliases: [cdn.example.com]
# ViewerCertificate:
# AcmCertificateArn: !Ref SSLCert
# SslSupportMethod: sni-only
#
# OAC:
# Type: AWS::CloudFront::OriginAccessControl
# Properties:
# OriginAccessControlConfig:
# Name: S3OAC
# OriginAccessControlOriginType: s3
# SigningBehavior: always
# SigningProtocol: sigv4
# ── S3 Bucket Policy allowing only CloudFront OAC ──
# {
# "Statement": [{
# "Effect": "Allow",
# "Principal": {"Service": "cloudfront.amazonaws.com"},
# "Action": "s3:GetObject",
# "Resource": "arn:aws:s3:::my-bucket/*",
# "Condition": {
# "StringEquals": {
# "AWS:SourceArn": "arn:aws:cloudfront::123:distribution/E1234"
# }
# }
# }]
# }
# ── Invalidation ──
aws cloudfront create-invalidation \
--distribution-id E1234ABCDEF \
--paths "/index.html" "/css/*"
# ── CloudFront Function: Add security headers ──
# function handler(event) {
# var response = event.response;
# response.headers["x-frame-options"] = {value: "DENY"};
# response.headers["x-content-type-options"] = {value: "nosniff"};
# response.headers["strict-transport-security"] = {
# value: "max-age=63072000; includeSubdomains; preload"
# };
# return response;
# }
An e-commerce site served images directly from S3 — first-time load took 800ms for users in Asia (S3 bucket in us-east-1). After adding CloudFront: first request still went to origin (cache miss), but subsequent requests from the same region returned in < 20ms (cache hit). They set up cache behaviors: /static/* → S3 with 1-year TTL, /api/* → ALB with no caching, /* → S3 (default). OAC blocked direct S3 access. Monthly S3 request costs dropped 80% because CloudFront absorbed the traffic.
What is CloudFront Functions vs Lambda@Edge? When would you use one over the other?
AWS provides three patterns for coordinating distributed systems:
Step Functions — orchestration (centralized control):
- Visual state machine that coordinates Lambda, ECS, Glue, SQS, and 200+ AWS services.
- States: Task, Choice (if/else), Parallel, Map (loop), Wait, Pass, Fail, Succeed.
- Built-in error handling: Retry with exponential backoff, Catch for fallback paths.
- Maintains execution state — you can see exactly where a workflow is at any time.
- Standard Workflows: up to 1 year, exactly-once, auditable ($0.025/1K transitions).
- Express Workflows: up to 5 minutes, at-least-once, high-volume ($0.000001/request).
- Best for: complex, stateful workflows — order processing, ETL pipelines, human approval flows.
SQS — choreography (decoupled, point-to-point):
- Simple queue — no workflow state, no branching, no parallel execution.
- Consumer processes messages independently. Dead Letter Queue for failures.
- Best for: simple task queues, decoupling, buffering between services.
EventBridge — choreography (event-driven, many-to-many):
- Events route to targets based on content rules. Loose coupling — producers don't know about consumers.
- No workflow state. Each target acts independently.
- Best for: event-driven architectures, SaaS integrations, decoupled microservices.
Orchestration vs Choreography: orchestration = one central coordinator (Step Functions). Choreography = services react to events independently (SQS/EventBridge).
# ── Step Functions: Order processing workflow (ASL) ──
# {
# "StartAt": "ValidateOrder",
# "States": {
# "ValidateOrder": {
# "Type": "Task",
# "Resource": "arn:aws:lambda:...:ValidateOrderFn",
# "Next": "CheckInventory",
# "Retry": [{"ErrorEquals": ["States.ALL"], "MaxAttempts": 3}],
# "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "OrderFailed"}]
# },
# "CheckInventory": {
# "Type": "Task",
# "Resource": "arn:aws:lambda:...:CheckInventoryFn",
# "Next": "ProcessPayment"
# },
# "ProcessPayment": {
# "Type": "Task",
# "Resource": "arn:aws:lambda:...:ProcessPaymentFn",
# "Next": "ParallelNotifications"
# },
# "ParallelNotifications": {
# "Type": "Parallel",
# "Branches": [
# {"StartAt": "SendEmail", "States": {"SendEmail": {"Type": "Task", "Resource": "arn:aws:lambda:...:SendEmailFn", "End": true}}},
# {"StartAt": "UpdateAnalytics", "States": {"UpdateAnalytics": {"Type": "Task", "Resource": "arn:aws:lambda:...:AnalyticsFn", "End": true}}}
# ],
# "Next": "OrderComplete"
# },
# "OrderComplete": {"Type": "Succeed"},
# "OrderFailed": {"Type": "Fail", "Error": "OrderProcessingFailed"}
# }
# }
# ── Create State Machine ──
aws stepfunctions create-state-machine \
--name OrderProcessing \
--definition file://workflow.json \
--role-arn arn:aws:iam::123:role/StepFunctionsRole
# ── Start execution ──
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123:stateMachine:OrderProcessing \
--input '{"orderId":"ORD-001","amount":99.99}'
# ── Comparison ──
# Feature | Step Functions | SQS | EventBridge
# Pattern | Orchestration | Choreography | Choreography
# State | Full state machine | None | None
# Branching | Choice, Parallel | None | Rule-based routing
# Error handling| Retry + Catch | DLQ | DLQ
# Visibility | Visual workflow | Queue depth | Events log
# Max duration | 1 year (Standard) | 14 days retain | Instant
# Cost model | Per transition | Per message | Per event
An ETL pipeline was built with chained Lambda functions triggering each other via SQS. When step 3 of 7 failed, there was no way to retry from step 3 — the entire pipeline had to restart from step 1 (re-processing 2 hours of work). After migrating to Step Functions, each step was a state with built-in retry (3 attempts, exponential backoff) and a catch block for notification. A failure at step 3 retried automatically, and the visual console showed exactly where the workflow was stuck.
What are Step Functions Express Workflows and when should you use them instead of Standard Workflows?
AWS WAF (Web Application Firewall) protects web applications from common web exploits at Layer 7.
Components:
- Web ACL: the main resource. Contains rules that are evaluated in order. Associated with CloudFront, ALB, API Gateway, or AppSync.
- Rules:
- Regular rules: match conditions (IP, header, body, URI) → Allow, Block, Count, or CAPTCHA.
- Rate-based rules: block IPs exceeding a threshold (e.g., > 2,000 requests in 5 minutes).
- Rule Groups:
- AWS Managed Rules: pre-built by AWS — Core Rule Set (CRS), SQL injection, XSS, bad bots, known bad inputs.
- Marketplace Rules: from third-party vendors (F5, Fortinet, Imperva).
- Custom Rule Groups: your own rules for application-specific logic.
- WCUs (Web ACL Capacity Units): each rule costs WCUs. Web ACL limit: 5,000 WCUs.
AWS Shield — DDoS protection:
- Shield Standard: free, automatic. Protects against Layer 3/4 DDoS (SYN floods, UDP reflection). Applied to all AWS resources.
- Shield Advanced: $3,000/month. Layer 3/4/7 protection, real-time visibility, DDoS Response Team (DRT), cost protection (refund for scaling during attack), health-based detection.
# ── Create Web ACL with managed rules ──
aws wafv2 create-web-acl \
--name MyAppWAF \
--scope REGIONAL \
--default-action Allow={} \
--rules '[
{
"Name": "AWSManagedRulesCommonRuleSet",
"Priority": 1,
"OverrideAction": {"None": {}},
"Statement": {
"ManagedRuleGroupStatement": {
"VendorName": "AWS",
"Name": "AWSManagedRulesCommonRuleSet"
}
},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "CommonRuleSet"
}
},
{
"Name": "RateLimitRule",
"Priority": 2,
"Action": {"Block": {}},
"Statement": {
"RateBasedStatement": {
"Limit": 2000,
"AggregateKeyType": "IP"
}
},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "RateLimit"
}
}
]'
# ── CloudFormation: WAF for ALB ──
# Resources:
# WebACL:
# Type: AWS::WAFv2::WebACL
# Properties:
# Name: MyAppWAF
# Scope: REGIONAL
# DefaultAction: {Allow: {}}
# Rules:
# - Name: AWSManagedRulesCommonRuleSet
# Priority: 1
# OverrideAction: {None: {}}
# Statement:
# ManagedRuleGroupStatement:
# VendorName: AWS
# Name: AWSManagedRulesCommonRuleSet
# - Name: SQLiProtection
# Priority: 2
# OverrideAction: {None: {}}
# Statement:
# ManagedRuleGroupStatement:
# VendorName: AWS
# Name: AWSManagedRulesSQLiRuleSet
# - Name: RateLimit
# Priority: 3
# Action: {Block: {}}
# Statement:
# RateBasedStatement:
# Limit: 2000
# AggregateKeyType: IP
#
# WebACLAssociation:
# Type: AWS::WAFv2::WebACLAssociation
# Properties:
# ResourceArn: !Ref ALB
# WebACLArn: !GetAtt WebACL.Arn
# ── Common AWS Managed Rule Groups ──
# AWSManagedRulesCommonRuleSet — general protection (XSS, SSRF, etc.)
# AWSManagedRulesSQLiRuleSet — SQL injection
# AWSManagedRulesKnownBadInputsRuleSet — log4j, Spring exploits
# AWSManagedRulesBotControlRuleSet — bot detection ($10/million)
# AWSManagedRulesATPRuleSet — account takeover protection
An e-commerce site was hit by a credential stuffing attack — bots trying thousands of username/password combinations on the login page. They deployed WAF with: (1) AWS Managed Rules Common Rule Set (blocked common exploits), (2) Rate-based rule at 100 requests per 5 minutes per IP on /login, (3) Bot Control managed rule to detect automated tools. Login abuse dropped by 99%. They also enabled Shield Advanced for their peak sales events — during a DDoS attack, the DDoS Response Team helped mitigate it within minutes.
What is AWS Firewall Manager and how does it centrally manage WAF rules across multiple accounts?
The AWS Well-Architected Framework provides best practices for building secure, high-performing, resilient, and efficient cloud architectures. It has 6 pillars:
1. Operational Excellence:
- Run and monitor systems, continuously improve processes.
- Key practices: IaC (CloudFormation/Terraform), CI/CD pipelines, runbooks, observability (CloudWatch, X-Ray), small frequent changes, anticipate failure.
2. Security:
- Protect data, systems, and assets. Principle of least privilege.
- Key practices: IAM roles (not users), encryption at rest and in transit, MFA, security groups, WAF, GuardDuty, detective controls, incident response plan.
3. Reliability:
- Recover from failures, meet demand. Design for failure.
- Key practices: Multi-AZ, Auto Scaling, health checks, circuit breakers, backups, DR testing, chaos engineering.
4. Performance Efficiency:
- Use computing resources efficiently as demand changes.
- Key practices: right-sizing (Compute Optimizer), serverless (Lambda, Fargate), caching (ElastiCache, CloudFront), global infrastructure, performance testing.
5. Cost Optimization:
- Avoid unnecessary costs. Pay only for what you use.
- Key practices: Reserved Instances, Savings Plans, Spot, right-sizing, lifecycle policies, Cost Explorer, budgets and alerts.
6. Sustainability (newest pillar):
- Minimize environmental impact of cloud workloads.
- Key practices: Graviton (energy-efficient), serverless, efficient code, data retention policies, Region selection (renewable energy).
# ── Well-Architected Tool: Create a workload review ──
aws wellarchitected create-workload \
--workload-name "E-Commerce Platform" \
--description "Production e-commerce application" \
--environment PRODUCTION \
--lenses wellarchitected \
--aws-regions us-east-1
# ── Answer a question in the review ──
aws wellarchitected update-answer \
--workload-id w-abc123 \
--lens-alias wellarchitected \
--question-id "ops-1" \
--selected-choices "ops_1_aws_cloud_ops_1" \
--notes "We use CloudFormation for all infrastructure"
# ── Pillar checklist (key questions per pillar) ──
# OPERATIONAL EXCELLENCE:
# □ Do you use IaC for all infrastructure?
# □ Do you have a CI/CD pipeline with automated testing?
# □ Do you have runbooks for common operational tasks?
# □ Do you have observability (metrics, logs, traces)?
#
# SECURITY:
# □ Is MFA enabled for all human users?
# □ Are IAM roles used instead of access keys?
# □ Is encryption enabled at rest and in transit?
# □ Are security groups following least privilege?
# □ Is CloudTrail enabled for audit logging?
#
# RELIABILITY:
# □ Are workloads deployed across multiple AZs?
# □ Is Auto Scaling configured with health checks?
# □ Do you have automated backups and tested restores?
# □ Is there a disaster recovery plan with defined RPO/RTO?
#
# PERFORMANCE EFFICIENCY:
# □ Are instances right-sized (Compute Optimizer)?
# □ Is caching used (CloudFront, ElastiCache)?
# □ Are you using serverless where appropriate?
#
# COST OPTIMIZATION:
# □ Are Reserved Instances/Savings Plans used for steady-state?
# □ Are unused resources identified and removed?
# □ Are S3 lifecycle policies configured?
# □ Are budgets and billing alerts set up?
#
# SUSTAINABILITY:
# □ Are you using Graviton instances (40% more energy-efficient)?
# □ Are you using serverless to minimize idle resources?
A fintech startup passed their Well-Architected Review with 15 high-risk issues (HRIs). Top findings: (1) no Multi-AZ for database — single point of failure, (2) no encryption on S3 buckets, (3) IAM users with admin access and no MFA, (4) no budget alerts — $8K surprise bill, (5) no DR plan. Over 3 months, they remediated all HRIs: RDS Multi-AZ, S3 SSE-KMS, IAM roles with least privilege, billing alerts, and quarterly DR drills. Their next review had zero HRIs.
What are Well-Architected Lenses and how do they extend the framework for specific workloads (serverless, SaaS, ML)?
AWS cost optimization is a continuous process with multiple strategies:
Pricing Models:
- On-Demand: pay per second/hour. No commitment. Full price. Use for variable, short-term workloads.
- Reserved Instances (RIs): 1 or 3-year commitment for a specific instance type in a specific Region. Up to 72% discount. Standard (fixed type) or Convertible (can change type).
- Savings Plans: 1 or 3-year commitment to a dollar amount per hour of compute. More flexible than RIs. Applies to EC2, Lambda, and Fargate. Recommended over RIs for most cases.
- Spot Instances: unused EC2 capacity at up to 90% discount. Can be interrupted with 2-minute notice. Best for batch processing, CI/CD, data analysis, fault-tolerant workloads.
Right-Sizing:
- Use AWS Compute Optimizer to identify over-provisioned instances.
- Common finding: instances running at 5-10% CPU → downsize to save 50%.
- Review monthly. Right-size before buying RIs/Savings Plans.
Other strategies:
- S3 Lifecycle Policies: move to cheaper tiers automatically.
- Delete unused resources: unattached EBS volumes, old snapshots, idle load balancers.
- Graviton instances: 40% better price/performance.
- Cost Explorer + Budgets: visibility and alerting.
# ── Buy a Savings Plan ──
aws savingsplans create-savings-plan \
--savings-plan-offering-id offering-abc123 \
--commitment 10.00 \
--term-duration-in-seconds 31536000 # 1 year
# ── Request Spot Instances ──
aws ec2 run-instances \
--instance-type m7g.xlarge \
--instance-market-options MarketType=spot \
--count 5 \
--image-id ami-abc123 \
--tag-specifications "ResourceType=instance,Tags=[{Key=Purpose,Value=BatchProcessing}]"
# ── Right-sizing: Get Compute Optimizer recommendations ──
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=Finding,values=OVER_PROVISIONED \
--query "instanceRecommendations[].{Instance:instanceArn,Current:currentInstanceType,Recommended:recommendationOptions[0].instanceType,Savings:recommendationOptions[0].estimatedMonthlySavings.value}"
# ── Find unused EBS volumes ──
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query "Volumes[].{VolumeId:VolumeId,Size:Size,Created:CreateTime}" \
--output table
# ── Set up billing alarm ──
aws cloudwatch put-metric-alarm \
--alarm-name BillingAlarm \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum --period 21600 \
--evaluation-periods 1 --threshold 1000 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123:billing-alerts \
--dimensions Name=Currency,Value=USD
# ── Cost comparison (m7g.large, us-east-1) ──
# Pricing Model | $/hour | Monthly (730hr) | Savings
# On-Demand | $0.0816 | $59.57 | 0%
# 1yr Savings Plan | $0.0518 | $37.81 | 37%
# 3yr Savings Plan | $0.0332 | $24.24 | 59%
# Spot Instance | ~$0.024 | ~$17.52 | ~70%
#
# ── When to use each ──
# Steady-state production → Savings Plan (1yr or 3yr)
# Variable production → On-Demand + Auto Scaling
# Batch processing → Spot Instances (fault-tolerant)
# Dev/Test → Spot or On-Demand + shutdown schedule
A company spent $50K/month on AWS. Cost optimization analysis found: (1) 40% of EC2 instances were running at < 10% CPU → right-sized, saving $8K, (2) 200 unattached EBS volumes → deleted, saving $1.5K, (3) S3 lifecycle policies → saved $3K, (4) Compute Savings Plans for steady-state workloads → saved $12K, (5) Spot Instances for nightly batch jobs → saved $4K. Total monthly savings: $28.5K (57% reduction). They set up Cost Explorer dashboards and $40K monthly budget alerts.
What is the difference between Reserved Instances and Savings Plans? Which should you choose?
Disaster Recovery (DR) ensures business continuity when infrastructure fails. Two key metrics define your DR requirements:
RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. If RPO = 1 hour, you can afford to lose up to 1 hour of data.
RTO (Recovery Time Objective): Maximum acceptable downtime. If RTO = 15 minutes, the system must be back online within 15 minutes.
Four DR strategies (ordered by cost and recovery time):
- 1. Backup & Restore: cheapest. Backups stored in S3/Glacier. On disaster: restore from backup, provision infrastructure. RPO: hours. RTO: hours. Cost: very low.
- 2. Pilot Light: critical components running (database replica), compute scaled to zero. On disaster: start compute, scale up. RPO: minutes. RTO: tens of minutes. Cost: low.
- 3. Warm Standby: scaled-down copy of production always running. On disaster: scale up to full production capacity. RPO: seconds. RTO: minutes. Cost: medium.
- 4. Multi-Site Active-Active: full production in 2+ Regions handling live traffic. On disaster: traffic shifts to surviving Region. RPO: near-zero. RTO: near-zero. Cost: high (2x+ infrastructure).
Choosing a strategy: based on business impact of downtime. A banking app (RTO < 1 min) needs active-active. A reporting dashboard (RTO < 4 hrs) can use pilot light.
# ── Strategy 1: Backup & Restore ──
# Automated backups:
aws rds modify-db-instance \
--db-instance-identifier mydb \
--backup-retention-period 7 \
--preferred-backup-window "03:00-04:00"
# S3 cross-region backup
aws s3 sync s3://prod-data s3://dr-backup-eu --storage-class STANDARD_IA
# To recover: create new infrastructure from CloudFormation,
# restore DB from latest snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier mydb-restored \
--db-snapshot-identifier rds:mydb-2026-05-30-03-00
# ── Strategy 2: Pilot Light ──
# Aurora read replica in DR Region (always running)
# ASG with min=0, max=10 in DR Region (no instances running)
# Route 53 failover with health checks
# On disaster: scale up ASG
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name DR-WebServers \
--min-size 3 --desired-capacity 6
# Promote Aurora replica to primary
aws rds failover-global-cluster \
--global-cluster-identifier my-global-db \
--target-db-cluster-identifier arn:aws:rds:eu-west-1:123:cluster:my-aurora-secondary
# ── Strategy 3: Warm Standby ──
# Scaled-down infra in DR Region always running
# Primary: 10 instances, DR: 2 instances
# Route 53 weighted: 100% primary, 0% DR (or failover)
# ── Strategy 4: Active-Active ──
# DynamoDB Global Tables + Aurora Global DB
# Route 53 latency-based routing
# Both Regions serve live traffic simultaneously
# ── DR Strategy comparison ──
# Strategy | RPO | RTO | Cost
# Backup/Restore| Hours | Hours | $
# Pilot Light | Minutes | 10-30 min | $$
# Warm Standby | Seconds | Minutes | $$$
# Active-Active | Near-zero | Near-zero | $$$$
# ── Key AWS services for DR ──
# Data: Aurora Global DB, DynamoDB Global Tables, S3 CRR
# Compute: ASG, Launch Templates, AMIs (copy cross-region)
# DNS: Route 53 failover/latency routing + health checks
# IaC: CloudFormation StackSets (deploy to multiple Regions)
A healthcare SaaS application needed < 15-minute RTO and < 1-minute RPO for compliance. They implemented warm standby: Aurora Global Database with a secondary cluster in eu-west-1 (RPO < 1 second replication lag), a scaled-down ECS service (2 tasks vs 10 in production), and Route 53 failover routing with health checks. Quarterly DR drills proved they could failover in 8 minutes. During an actual us-east-1 outage, the automated failover completed in 6 minutes — within their 15-minute RTO commitment.
What is AWS Elastic Disaster Recovery (DRS) and how does it simplify DR for on-premises workloads?
AWS Control Tower automates the setup and governance of a secure, multi-account AWS environment (a "landing zone").
What it provides:
- Account Factory: automated account provisioning with predefined configurations (VPC, subnets, IAM, logging). Create new accounts in minutes via Service Catalog.
- Guardrails (Controls):
- Preventive: SCPs that prevent non-compliant actions (e.g., "disallow public S3 buckets").
- Detective: AWS Config rules that detect non-compliance (e.g., "S3 bucket without encryption").
- Proactive: CloudFormation hooks that block non-compliant resource creation before deployment.
- Dashboard: centralized view of compliance across all accounts.
- Log Archive Account: centralized CloudTrail and Config logs (immutable).
- Audit Account: security team access to all accounts for investigation.
Landing Zone Architecture:
- Management Account: Organizations root. Billing. No workloads.
- Log Archive Account: centralized CloudTrail, Config, VPC Flow Logs. S3 bucket with object lock.
- Audit/Security Account: GuardDuty admin, Security Hub, IAM Access Analyzer.
- Shared Services Account: Transit Gateway, DNS, CI/CD pipelines.
- Workload Accounts: separate accounts per environment (prod, staging, dev) and per team.
# ── Enable Control Tower (via Console — no CLI support yet) ──
# 1. Go to AWS Control Tower in the management account
# 2. Set up landing zone → creates:
# - Log Archive account
# - Audit account
# - Security OU, Sandbox OU
# - 20+ mandatory guardrails
# ── Account Factory: Create a new workload account ──
# Via Service Catalog or CLI:
aws servicecatalog provision-product \
--product-name "AWS Control Tower Account Factory" \
--provisioned-product-name "team-alpha-prod" \
--provisioning-parameters \
Key=AccountName,Value=team-alpha-prod \
Key=AccountEmail,Value=team-alpha-prod@company.com \
Key=SSOUserEmail,Value=admin@company.com \
Key=ManagedOrganizationalUnit,Value="Workloads/Production"
# ── Guardrails (Controls) ──
# Mandatory (always on):
# - Disallow changes to CloudTrail configuration
# - Disallow changes to AWS Config rules
# - Disallow deletion of log archive
# Strongly recommended:
# - Enable encryption for EBS volumes
# - Disallow public S3 buckets
# - Disallow internet access for RDS instances
# - Enable MFA for root user
# ── Enable additional guardrails ──
aws controltower enable-control \
--control-identifier arn:aws:controltower:us-east-1::control/AWS-GR_ENCRYPTED_VOLUMES \
--target-identifier arn:aws:organizations::123:ou/o-xxx/ou-xxx
# ── Landing Zone structure ──
# Root (Management Account — billing only)
# ├── Security OU
# │ ├── Log Archive Account (CloudTrail, Config, immutable S3)
# │ └── Audit Account (GuardDuty, Security Hub, IAM Analyzer)
# ├── Infrastructure OU
# │ └── Shared Services (Transit Gateway, DNS, CI/CD)
# ├── Workloads OU
# │ ├── Production OU
# │ │ ├── team-alpha-prod
# │ │ └── team-beta-prod
# │ ├── Staging OU
# │ └── Development OU
# └── Sandbox OU (experimentation, budget limits)
# ── Customizations for Control Tower (CfCT) ──
# Deploy additional CloudFormation stacks to new accounts:
# - VPC with standard subnets
# - Security groups baseline
# - IAM roles for CI/CD
# - CloudWatch alarm baseline
A growing startup with 3 teams needed separate AWS environments. They set up Control Tower: 8 accounts (mgmt, log archive, audit, shared services, team-a-prod, team-a-dev, team-b-prod, team-b-dev). Account Factory created each account in 15 minutes with standardized VPCs, IAM roles, and guardrails. Detective guardrails caught a developer who created an unencrypted RDS instance in prod — automatically flagged in the dashboard. Centralized CloudTrail logs in the log archive account made security audits trivial.
What is AWS Config and how does it complement Control Tower for continuous compliance monitoring?
Observability = Metrics + Logs + Traces. The ability to understand system behavior from external outputs.
AWS X-Ray — distributed tracing:
- Traces requests across multiple services (API Gateway → Lambda → DynamoDB → SQS → Lambda).
- Generates a Service Map: visual graph of service dependencies with latency and error rates.
- Traces: end-to-end path of a request with timing for each segment.
- Sampling: traces a percentage of requests (default 5%) to manage cost.
- Integrates with: Lambda, API Gateway, ECS, EKS, EC2, Elastic Beanstalk.
- Use X-Ray SDK in your code to add custom subsegments and annotations.
CloudWatch Container Insights:
- Collects and aggregates metrics from ECS and EKS: CPU, memory, disk, network per cluster, service, task, pod.
- Pre-built dashboards for container performance.
- Uses CloudWatch Agent (EC2 launch type) or Fluent Bit (sidecar or daemonset).
CloudWatch Application Insights: auto-detects application components and sets up monitoring for .NET, Java, and SQL Server workloads.
Amazon Managed Grafana: managed Grafana for custom dashboards. Sources: CloudWatch, X-Ray, Prometheus, Elasticsearch.
Amazon Managed Prometheus: managed Prometheus for Kubernetes metrics. Works with EKS.
# ── Enable X-Ray for Lambda ──
aws lambda update-function-configuration \
--function-name ProcessOrder \
--tracing-config Mode=Active
# ── Python: X-Ray SDK for custom subsegments ──
# from aws_xray_sdk.core import xray_recorder, patch_all
# patch_all() # Auto-instrument boto3, requests, etc.
#
# @xray_recorder.capture("process_payment")
# def process_payment(order):
# # Add annotation for filtering in X-Ray console
# xray_recorder.current_subsegment().put_annotation("orderId", order["id"])
# xray_recorder.current_subsegment().put_annotation("amount", order["amount"])
# # ... payment logic
# return {"status": "success"}
#
# def handler(event, context):
# subsegment = xray_recorder.begin_subsegment("validate_input")
# order = validate(event)
# xray_recorder.end_subsegment()
#
# result = process_payment(order)
# return result
# ── Enable Container Insights for ECS ──
aws ecs update-cluster-settings \
--cluster my-cluster \
--settings name=containerInsights,value=enabled
# ── EKS: Install CloudWatch agent for Container Insights ──
# kubectl apply -f https://raw.githubusercontent.com/aws-samples/\
# amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/\
# deployment-mode/daemonSet/container-insights-monitoring/quickstart/\
# cwagent-fluentd-quickstart.yaml
# ── X-Ray: Get Service Map ──
aws xray get-service-graph \
--start-time 2026-05-30T00:00:00Z \
--end-time 2026-05-30T23:59:59Z
# ── Observability stack architecture ──
# Metrics: CloudWatch Metrics + Container Insights → Grafana
# Logs: CloudWatch Logs + Fluent Bit → Logs Insights
# Traces: X-Ray + SDK instrumentation → Service Map
#
# The three pillars together:
# "API latency increased" → (Metrics: P99 spike)
# "Which requests?" → (Traces: X-Ray shows DB calls slow)
# "What error?" → (Logs: CloudWatch Logs Insights query)
# ── CloudWatch dashboard (IaC) ──
# Resources:
# ObservabilityDashboard:
# Type: AWS::CloudWatch::Dashboard
# Properties:
# DashboardName: AppOverview
# DashboardBody: !Sub |
# {"widgets": [
# {"type": "metric", "properties": {
# "metrics": [["AWS/Lambda","Duration","FunctionName","ProcessOrder"]],
# "title": "Lambda Duration"
# }},
# {"type": "metric", "properties": {
# "metrics": [["AWS/Lambda","Errors","FunctionName","ProcessOrder"]],
# "title": "Lambda Errors"
# }}
# ]}
A microservices team had 15 services on ECS. When users reported slow checkout, the team spent 3 hours checking each service's logs individually. After enabling X-Ray with SDK instrumentation and Container Insights, they could: (1) see the Service Map showing the checkout flow: API → OrderService → PaymentService → DynamoDB, (2) find the bottleneck: PaymentService → DynamoDB query taking 2 seconds (should be 10ms), (3) drill into the trace and see a missing GSI causing a table scan. Total debugging time: 5 minutes.
What is OpenTelemetry (OTEL) and how does AWS Distro for OpenTelemetry (ADOT) work with X-Ray?
A data lake stores structured, semi-structured, and unstructured data in its raw form for analytics and ML.
Key components:
- Amazon S3 — the storage layer. Data stored in open formats (Parquet, ORC, JSON, CSV). Organized by zones:
- Raw/Landing: original data as received (JSON, CSV).
- Cleansed/Processed: cleaned, validated, converted to Parquet.
- Curated/Analytics: aggregated, enriched, ready for queries.
- AWS Glue — ETL (Extract, Transform, Load):
- Glue Crawlers: auto-discover schema and populate the Glue Data Catalog.
- Glue Jobs: serverless Spark/Python ETL to transform data.
- Glue Data Catalog: centralized metadata store (like Apache Hive Metastore).
- Amazon Athena — serverless SQL query engine:
- Query S3 data directly using SQL. No infrastructure to manage.
- Pay per query ($5/TB scanned). Use Parquet + partitioning to minimize cost.
- Integrates with Glue Data Catalog for table definitions.
- AWS Lake Formation — governance:
- Centralized permissions for the data lake (row-level, column-level, cell-level security).
- Simplifies data sharing across accounts.
- Data lineage and audit trails.
# ── Data Lake S3 structure ──
# s3://my-data-lake/
# ├── raw/ # Landing zone (original data)
# │ ├── orders/2026/05/30/ # Partitioned by date
# │ │ └── orders.json
# │ └── clickstream/2026/05/30/
# │ └── events.json
# ├── processed/ # Cleaned, Parquet format
# │ └── orders/year=2026/month=05/day=30/
# │ └── part-00000.snappy.parquet
# └── curated/ # Analytics-ready
# └── daily_revenue/year=2026/month=05/
# └── revenue.parquet
# ── Glue Crawler: Auto-discover schema ──
aws glue create-crawler \
--name orders-crawler \
--role GlueServiceRole \
--database-name datalake \
--targets '{"S3Targets":[{"Path":"s3://my-data-lake/processed/orders/"}]}'
aws glue start-crawler --name orders-crawler
# Creates table "orders" in Glue Data Catalog with schema
# ── Athena: Query S3 data with SQL ──
aws athena start-query-execution \
--query-string "SELECT product, SUM(amount) as revenue
FROM datalake.orders
WHERE year='2026' AND month='05'
GROUP BY product
ORDER BY revenue DESC
LIMIT 10" \
--result-configuration OutputLocation=s3://athena-results/
# ── Glue ETL Job (PySpark) ──
# import sys
# from awsglue.transforms import *
# from awsglue.context import GlueContext
# from pyspark.context import SparkContext
#
# sc = SparkContext()
# glueContext = GlueContext(sc)
#
# # Read from raw zone
# raw_df = glueContext.create_dynamic_frame.from_catalog(
# database="datalake", table_name="raw_orders"
# )
#
# # Transform: clean, filter, flatten
# cleaned = raw_df.filter(lambda x: x["amount"] > 0)
#
# # Write to processed zone in Parquet (partitioned)
# glueContext.write_dynamic_frame.from_options(
# frame=cleaned,
# connection_type="s3",
# connection_options={
# "path": "s3://my-data-lake/processed/orders/",
# "partitionKeys": ["year", "month", "day"]
# },
# format="parquet"
# )
# ── Lake Formation: Grant permissions ──
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123:role/AnalystRole \
--resource '{"Table":{"DatabaseName":"datalake","Name":"orders"}}' \
--permissions SELECT \
--permissions-with-grant-option []
# ── Cost optimization ──
# Raw JSON: Athena scans 100 GB → $0.50/query
# Parquet + Snappy: scans 10 GB → $0.05/query (90% cheaper!)
# Add partitioning: scans 1 GB → $0.005/query (99% cheaper!)
A retail company had data in 15 siloed databases. They built a data lake: Kinesis Firehose streamed clickstream data to S3 raw zone, Glue ETL jobs transformed and converted to Parquet in the processed zone, Glue Crawlers kept the Data Catalog updated, Athena powered a QuickSight dashboard for business analytics. Lake Formation enforced column-level security — marketing could see purchase data but not PII. Athena query costs dropped 95% by switching from CSV to partitioned Parquet.
What is the difference between Athena, Redshift, and EMR for data analytics? When do you use each?
The 7 Rs of migration define strategies for moving workloads to AWS:
- 1. Retire: decommission applications that are no longer needed. Saves cost immediately. Typically 10-20% of a portfolio.
- 2. Retain: keep on-premises for now. Not ready to migrate (compliance, dependency, technical debt). Revisit later.
- 3. Rehost (Lift & Shift): move as-is to AWS (EC2). No code changes. Fastest migration. Use AWS Application Migration Service (MGN) for automated rehosting. Good for quick wins and getting out of data center leases.
- 4. Relocate: move to AWS with minimal changes (e.g., VMware Cloud on AWS, container lift-and-shift to ECS).
- 5. Replatform (Lift & Optimize): make minimal changes for cloud benefits. Examples: database → RDS, app server → Elastic Beanstalk, Windows → Linux. Low effort, meaningful improvement.
- 6. Repurchase (Drop & Shop): switch to a SaaS alternative. On-prem CRM → Salesforce, on-prem email → Microsoft 365, on-prem HR → Workday.
- 7. Refactor/Re-architect: redesign the application to be cloud-native. Monolith → microservices, serverless, containers. Most effort but most benefit. Use for strategic applications that need to scale.
Migration services:
- AWS MGN (Application Migration Service): automated lift-and-shift. Continuous replication → cutover with minimal downtime.
- AWS DMS (Database Migration Service): migrate databases to RDS/Aurora/DynamoDB. Supports heterogeneous migration (Oracle → PostgreSQL) using SCT (Schema Conversion Tool).
- AWS Migration Hub: central dashboard to track migration progress across tools.
# ── AWS DMS: Migrate Oracle to Aurora PostgreSQL ──
# 1. Create replication instance
aws dms create-replication-instance \
--replication-instance-identifier oracle-to-aurora \
--replication-instance-class dms.r5.large \
--allocated-storage 100
# 2. Create source endpoint (Oracle)
aws dms create-endpoint \
--endpoint-identifier oracle-source \
--endpoint-type source \
--engine-name oracle \
--server-name oracle.onprem.company.com \
--port 1521 \
--username dms_user \
--password "****" \
--database-name ORCL
# 3. Create target endpoint (Aurora PostgreSQL)
aws dms create-endpoint \
--endpoint-identifier aurora-target \
--endpoint-type target \
--engine-name aurora-postgresql \
--server-name myaurora.cluster-xxx.rds.amazonaws.com \
--port 5432 \
--username dms_user \
--password "****" \
--database-name appdb
# 4. Create migration task (full load + CDC)
aws dms create-replication-task \
--replication-task-identifier full-migration \
--source-endpoint-arn arn:aws:dms:...:endpoint:oracle-source \
--target-endpoint-arn arn:aws:dms:...:endpoint:aurora-target \
--replication-instance-arn arn:aws:dms:...:rep:oracle-to-aurora \
--migration-type full-load-and-cdc \
--table-mappings file://table-mappings.json
# ── 7 Rs decision matrix ──
# Application Profile | Strategy | AWS Tool
# Legacy, unused | Retire | N/A
# Not ready / compliance hold | Retain | N/A
# Standard VM workload | Rehost | MGN (lift-and-shift)
# VMware workloads | Relocate | VMware Cloud on AWS
# Database to managed | Replatform | DMS + RDS/Aurora
# COTS software available as SaaS| Repurchase | SaaS vendor
# Strategic app, needs scale | Refactor | Containers, Lambda, etc.
# ── MGN: Application Migration Service ──
# 1. Install replication agent on source server
# 2. Agent replicates to AWS (continuous block-level replication)
# 3. Test: launch test instance from replicated data
# 4. Cutover: launch final instance, update DNS
# Minimal downtime: only the final cutover (minutes)
# ── Migration Hub: Track progress ──
aws migrationhub notify-migration-task-state \
--progress-update-stream my-migration \
--migration-task-name "Oracle DB Migration" \
--task Status=IN_PROGRESS \
--update-date-time 2026-05-30T12:00:00Z
A company migrated 200 applications from their data center to AWS over 18 months. Assessment phase: 30 applications retired (unused), 20 repurchased (moved to SaaS), 100 rehosted with MGN (2-4 weeks each), 30 replatformed (databases to RDS, middleware to managed services), 20 refactored to serverless/containers (strategic apps). DMS migrated 15 databases including a critical Oracle-to-Aurora PostgreSQL migration with zero downtime using Change Data Capture (CDC). Total data center cost savings: $2.5M/year.
How does AWS Schema Conversion Tool (SCT) work for heterogeneous database migrations?
EC2 performance optimization involves networking, storage, and instance placement:
Placement Groups:
- Cluster: instances in the same rack/AZ. Lowest latency (< 1μs), highest throughput (up to 100 Gbps between instances). For HPC, tightly coupled workloads.
- Spread: each instance on different hardware. Max 7 instances per AZ. For critical instances that must not fail together.
- Partition: instances grouped into partitions on separate racks. For large distributed systems (Hadoop, Cassandra, Kafka). Up to 7 partitions per AZ.
Enhanced Networking:
- ENA (Elastic Network Adapter): up to 200 Gbps network bandwidth. SR-IOV (hardware-level virtualization bypass). Lower latency, higher PPS (packets per second).
- EFA (Elastic Fabric Adapter): OS-bypass networking for HPC and ML. Enables MPI and NCCL communication. For GPU clusters (P5, G5).
- Most current-gen instances have ENA enabled by default.
Instance Store:
- NVMe SSDs physically attached to the host. Ephemeral — data lost on stop/terminate.
- Extremely fast: up to 7.5 million IOPS (i4i.metal), microsecond latency.
- Use for: temp data, caches, buffers, scratch space, data replicated elsewhere.
- Available on specific instance types (i4i, c5d, m5d, r5d).
Nitro System: custom hardware + lightweight hypervisor. Nearly all CPU/memory available to the instance. Better security (hardware root of trust) and performance.
# ── Create a Cluster Placement Group ──
aws ec2 create-placement-group \
--group-name hpc-cluster \
--strategy cluster
# Launch instances in the cluster
aws ec2 run-instances \
--instance-type c7gn.16xlarge \
--placement GroupName=hpc-cluster \
--count 8 \
--image-id ami-abc123
# ── Verify Enhanced Networking (ENA) ──
aws ec2 describe-instances \
--instance-ids i-abc123 \
--query "Reservations[].Instances[].EnaSupport"
# Output: true
# ── Instance Store: Check available NVMe drives ──
# lsblk (on the instance)
# NAME SIZE TYPE
# nvme0n1 1.9T disk ← Instance store (ephemeral!)
# nvme1n1 1.9T disk ← Instance store
# nvme2n1 500G disk ← EBS root volume
# ── Instance Store: Format and mount ──
# mkfs.xfs /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/scratch
# Warning: data lost on stop/terminate!
# ── CloudFormation: Cluster Placement Group ──
# Resources:
# HPCPlacementGroup:
# Type: AWS::EC2::PlacementGroup
# Properties:
# Strategy: cluster
#
# HPCInstance:
# Type: AWS::EC2::Instance
# Properties:
# InstanceType: c7gn.16xlarge
# Placement:
# GroupName: !Ref HPCPlacementGroup
# NetworkInterfaces:
# - DeviceIndex: 0
# SubnetId: !Ref HPCSubnet
# Groups: [!Ref HPCSG]
# InterfaceType: efa # EFA for HPC networking
# ── Performance comparison ──
# Networking:
# Standard: up to 25 Gbps, ~100μs latency
# ENA: up to 200 Gbps, ~25μs latency
# EFA: up to 400 Gbps, ~5μs latency (OS bypass)
# Cluster PG: < 1μs between instances (same rack)
#
# Storage:
# gp3 EBS: 16,000 IOPS, ~1ms latency
# io2 EBS: 256,000 IOPS, ~sub-ms latency
# Instance Store: 7.5M IOPS (i4i.metal), ~μs latency
A genomics company ran HPC workloads processing DNA sequences. Initial setup: 16 × c5.4xlarge in random placement → inter-node MPI communication took 200μs, job completed in 8 hours. After optimization: 16 × c7gn.16xlarge in a cluster placement group with EFA enabled → inter-node latency dropped to 5μs, job completed in 2.5 hours. They used instance store NVMe for scratch data (4x faster than EBS) and EBS only for final results that needed persistence.
What is the difference between ENA and EFA? When would you use EFA over ENA?
S3 is designed for high throughput, but optimization is still important for large-scale workloads:
Request Rate Performance:
- S3 supports 5,500 GET/HEAD and 3,500 PUT/DELETE requests per second per prefix.
- A prefix is the path before the object key:
s3://bucket/prefix1/key. - To scale beyond these limits, distribute objects across prefixes.
- S3 automatically partitions prefixes that receive high request rates (no manual intervention needed since 2018).
Multipart Upload:
- Upload large objects in parallel parts. Each part is uploaded independently.
- Mandatory for objects > 5 GB. Recommended for objects > 100 MB.
- Benefits: parallel uploads (faster), retry individual parts (resilient), pause/resume.
- Part size: 5 MB to 5 GB. Maximum 10,000 parts.
S3 Transfer Acceleration:
- Uses CloudFront Edge Locations to accelerate uploads from distant locations.
- Client uploads to the nearest Edge Location → AWS backbone → S3 bucket.
- 50-500% improvement for long-distance uploads (e.g., Asia → us-east-1).
- Adds $0.04-0.08/GB cost. Use the speed comparison tool to verify benefit.
S3 Select / Glacier Select:
- Filter data server-side using SQL. Only transfer the rows/columns you need.
- Reduces data transfer by up to 400%. Faster and cheaper than downloading entire objects.
Byte-Range Fetches: download specific byte ranges of an object in parallel. Useful for large files where you need only a portion.
# ── Multipart Upload (AWS CLI does this automatically for large files) ──
aws s3 cp large-file.zip s3://my-bucket/ \
--expected-size 5368709120 # 5 GB
# ── Manual multipart with boto3 (for custom control) ──
# import boto3
# from boto3.s3.transfer import TransferConfig
#
# s3 = boto3.client("s3")
# config = TransferConfig(
# multipart_threshold=100 * 1024 * 1024, # 100 MB
# multipart_chunksize=25 * 1024 * 1024, # 25 MB per part
# max_concurrency=10 # 10 parallel uploads
# )
# s3.upload_file("large-file.zip", "my-bucket", "large-file.zip", Config=config)
# ── Enable Transfer Acceleration ──
aws s3api put-bucket-accelerate-configuration \
--bucket my-bucket \
--accelerate-configuration Status=Enabled
# Upload using accelerated endpoint
aws s3 cp large-file.zip s3://my-bucket/ \
--endpoint-url https://my-bucket.s3-accelerate.amazonaws.com
# ── S3 Select: Query CSV without downloading ──
aws s3api select-object-content \
--bucket my-bucket \
--key data/sales.csv \
--expression "SELECT s.product, s.amount FROM s3object s WHERE s.amount > '1000'" \
--expression-type SQL \
--input-serialization '{"CSV":{"FileHeaderInfo":"USE"}}' \
--output-serialization '{"JSON":{}}'
# ── Python: S3 Select ──
# response = s3.select_object_content(
# Bucket="my-bucket",
# Key="data/sales.csv",
# Expression="SELECT * FROM s3object s WHERE s.region = 'US'",
# ExpressionType="SQL",
# InputSerialization={"CSV": {"FileHeaderInfo": "USE"}},
# OutputSerialization={"JSON": {}}
# )
# for event in response["Payload"]:
# if "Records" in event:
# print(event["Records"]["Payload"].decode())
# ── Prefix distribution for high request rates ──
# Bad: all objects under one prefix
# s3://bucket/images/img001.jpg → 1 prefix, limited to 5,500 GET/s
# s3://bucket/images/img002.jpg
#
# Good: distribute across prefixes (use hash or date)
# s3://bucket/a1b2/images/img001.jpg → prefix "a1b2"
# s3://bucket/c3d4/images/img002.jpg → prefix "c3d4"
# Or: s3://bucket/2026/05/30/img001.jpg
# ── Performance tuning checklist ──
# ✅ Objects > 100 MB: use multipart upload
# ✅ Cross-continent uploads: enable Transfer Acceleration
# ✅ Need only subset of data: use S3 Select
# ✅ > 5,500 GET/s: distribute across multiple prefixes
# ✅ Large downloads: use byte-range fetches (parallel)
# ✅ Analytics: use Parquet (columnar) instead of CSV/JSON
A media company ingested 10TB of video files daily from studios in Asia to an S3 bucket in us-east-1. Upload speeds: 50 Mbps (bottlenecked by internet path). After enabling Transfer Acceleration, speeds jumped to 300 Mbps (6x improvement) because data went to the nearest Edge Location in Tokyo and then traveled over the AWS backbone. Combined with multipart upload (25 MB chunks, 20 concurrent), total ingest time dropped from 18 hours to 3 hours.
What is S3 Express One Zone and how does it provide single-digit millisecond latency for S3 access?
DynamoDB performance depends on partition key design and access patterns:
Hot Partitions:
- Each partition handles up to 3,000 RCUs and 1,000 WCUs.
- If a single partition key gets disproportionate traffic, that partition becomes "hot" → throttling even with available capacity on other partitions.
- Common causes: using date, status, or country as PK (low cardinality), popular items (viral product, trending post).
Solutions for hot keys:
- Write sharding: append a random suffix to the partition key (e.g., "product#123#shard3"). Distributes writes across multiple partitions. Requires scatter-gather reads.
- Composite keys: use a high-cardinality PK (userId, orderId) instead of low-cardinality (status, date).
- Adaptive Capacity: DynamoDB automatically redistributes throughput to hot partitions (enabled by default). Helps but doesn't solve extreme cases.
DAX (DynamoDB Accelerator):
- Fully managed, in-memory write-through cache for DynamoDB.
- Microsecond latency for reads (vs milliseconds for DynamoDB).
- API-compatible — just change the endpoint, no code changes.
- Best for read-heavy workloads with repeated access to the same items.
Other optimizations:
- Projection: only read the attributes you need (ProjectionExpression).
- BatchGetItem: read up to 100 items in a single API call (16 MB max).
- Parallel Scan: divide a full table scan across multiple threads/segments.
- TTL: automatically delete expired items — free, no WCU cost.
# ── Write Sharding: Distribute hot keys ──
# Problem: PK = "trending_products" gets 10,000 WCU/s
# Solution: shard across N keys
# import random
# SHARD_COUNT = 10
#
# def write_trending(product_data):
# shard = random.randint(0, SHARD_COUNT - 1)
# table.put_item(Item={
# "PK": f"trending_products#{shard}", # Distributed!
# "SK": product_data["productId"],
# "data": product_data
# })
#
# def read_all_trending():
# """Scatter-gather across all shards"""
# items = []
# for shard in range(SHARD_COUNT):
# response = table.query(
# KeyConditionExpression=Key("PK").eq(f"trending_products#{shard}")
# )
# items.extend(response["Items"])
# return items
# ── DAX: In-memory cache ──
aws dax create-cluster \
--cluster-name my-dax \
--node-type dax.r5.large \
--replication-factor 3 \
--iam-role-arn arn:aws:iam::123:role/DAXRole \
--subnet-group-name my-dax-subnets \
--security-group-ids sg-dax
# Python: Switch from DynamoDB to DAX (minimal code change)
# import amazondax
#
# # Before (DynamoDB direct):
# # dynamodb = boto3.resource("dynamodb")
#
# # After (DAX — API compatible!):
# dax = amazondax.AmazonDaxClient(endpoint_url="daxs://my-dax.xxx.dax-clusters.us-east-1.amazonaws.com:8111")
# table = dax.Table("Products")
# response = table.get_item(Key={"productId": "P001"})
# # Microsecond response from cache!
# ── Efficient queries ──
# Only read attributes you need (save RCUs)
aws dynamodb query \
--table-name Orders \
--key-condition-expression "customerId = :cid" \
--projection-expression "orderId, amount, #s" \
--expression-attribute-names '{"#s": "status"}' \
--expression-attribute-values '{":cid": {"S": "CUST-001"}}'
# ── BatchGetItem: Read up to 100 items at once ──
aws dynamodb batch-get-item --request-items '{
"Products": {
"Keys": [
{"productId": {"S": "P001"}},
{"productId": {"S": "P002"}},
{"productId": {"S": "P003"}}
],
"ProjectionExpression": "productId, productName, price"
}
}'
# ── Enable TTL (auto-delete expired items) ──
aws dynamodb update-time-to-live \
--table-name Sessions \
--time-to-live-specification Enabled=true,AttributeName=expiresAt
# ── Performance metrics to monitor ──
# ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits
# ThrottledRequests (should be 0!)
# SuccessfulRequestLatency (P99 should be < 10ms)
# AccountProvisionedReadCapacityUtilization
A social media app used "trending" as a partition key for popular posts. During viral events, this single key received 50,000 reads/sec — far exceeding the per-partition limit. Throttling caused the trending feed to fail. The fix: write sharding with 20 shards (trending#0 through trending#19) distributed the load evenly. They also added DAX for the trending feed — cache hit rate was 99.5%, reducing DynamoDB RCU consumption by 200x and dropping read latency from 5ms to 50μs.
What is DynamoDB Global Tables and how does it handle write conflicts in multi-region deployments?
Lambda performance optimization focuses on cold starts, memory/CPU allocation, and code efficiency:
Cold Start Deep Dive:
- What happens: download code → start runtime → run init code → execute handler.
- Duration by runtime: Python/Node (~200-500ms), .NET (~400-800ms), Java (~1-3 seconds).
- Factors: package size (larger = slower), VPC (adds ENI creation ~1-2s extra), runtime, number of dependencies.
- When: first invocation, after scale-up, after ~15 minutes of inactivity.
Mitigation strategies:
- Provisioned Concurrency: pre-warm N execution environments. No cold starts. Pay even when idle. Best for APIs with consistent latency requirements.
- SnapStart (Java only): snapshot of initialized runtime → restore from snapshot on cold start. Reduces Java cold start from 3s to ~200ms. Free.
- Smaller packages: remove unused dependencies. Use Layers for shared code. Use tree-shaking (Node/TypeScript).
- Avoid VPC unless needed: VPC adds ENI creation time. Use VPC endpoints instead of NAT for AWS service access.
Memory Tuning:
- Lambda allocates CPU proportional to memory. 1,769 MB = 1 full vCPU.
- CPU-bound functions benefit from more memory (even if they don't use the extra RAM).
- Use AWS Lambda Power Tuning (open-source tool) to find the optimal memory setting — often the cheapest AND fastest option is higher memory.
# ── Set Provisioned Concurrency ──
# First, publish a version (can't set PC on $LATEST)
aws lambda publish-version \
--function-name ProcessOrder \
--description "v1"
aws lambda put-provisioned-concurrency-config \
--function-name ProcessOrder \
--qualifier 1 \
--provisioned-concurrent-executions 50
# ── Enable SnapStart (Java only) ──
aws lambda update-function-configuration \
--function-name JavaOrderProcessor \
--snap-start ApplyOn=PublishedVersions
# Publish a version to trigger snapshot creation
aws lambda publish-version --function-name JavaOrderProcessor
# ── Auto Scaling for Provisioned Concurrency ──
aws application-autoscaling register-scalable-target \
--service-namespace lambda \
--resource-id function:ProcessOrder:prod \
--scalable-dimension lambda:function:ProvisionedConcurrency \
--min-capacity 10 --max-capacity 200
aws application-autoscaling put-scaling-policy \
--service-namespace lambda \
--resource-id function:ProcessOrder:prod \
--scalable-dimension lambda:function:ProvisionedConcurrency \
--policy-name LambdaPCAutoScaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 0.7,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization"
}
}'
# ── Lambda Power Tuning results (example) ──
# Memory (MB) | Duration (ms) | Cost ($) | Notes
# 128 | 3,200 | 0.000053 | CPU throttled, very slow
# 512 | 850 | 0.000056 | Better, still CPU-bound
# 1024 | 420 | 0.000055 | Sweet spot!
# 1769 | 250 | 0.000057 | Full vCPU, diminishing returns
# 3008 | 240 | 0.000093 | Minimal improvement, more expensive
# Optimal: 1024 MB — fastest AND cheapest!
# ── Cold start reduction tips ──
# 1. Keep deployment package small
# zip -r function.zip handler.py # Not the entire virtualenv
# Use Layers for shared dependencies
#
# 2. Initialize outside the handler
# import boto3
# dynamodb = boto3.resource("dynamodb") # Init once!
# table = dynamodb.Table("Orders") # Reused on warm starts
#
# def handler(event, context):
# return table.get_item(Key={"id": event["id"]})
#
# 3. Avoid unnecessary imports
# # Bad: import boto3 (loads entire SDK)
# # Good: from boto3 import client (loads only what you need)
#
# 4. Use arm64 (Graviton) — 34% cheaper, often faster
aws lambda update-function-configuration \
--function-name ProcessOrder \
--architectures arm64
A Java-based API had 3-second cold starts on Lambda — P99 latency was 3.5 seconds. The team applied three optimizations: (1) SnapStart reduced cold start to 200ms (free!), (2) Lambda Power Tuning found 1,536 MB as optimal — duration dropped from 800ms to 200ms at the same cost, (3) switched to arm64 (Graviton2) for 34% cost savings. Final P99: 250ms. For the payment endpoint (zero cold start tolerance), they added Provisioned Concurrency with auto-scaling — P99 dropped to 50ms.
How does Lambda Graviton2 (arm64) performance compare to x86? What compatibility issues exist?
Network performance optimization on AWS involves latency, throughput, and data transfer cost management:
VPC Endpoints (reduce latency + cost):
- Gateway Endpoint: for S3 and DynamoDB. Routes traffic through AWS backbone instead of the internet/NAT. Free. No bandwidth limit.
- Interface Endpoint (PrivateLink): for 100+ AWS services. Creates ENI in your VPC. $0.01/hr + $0.01/GB. Private, lower latency than going through NAT Gateway.
- Using VPC endpoints instead of NAT Gateway for AWS service access saves the NAT Gateway data processing fee ($0.045/GB).
AWS Global Accelerator:
- Provides 2 static anycast IPs that route traffic to the nearest AWS edge location → AWS backbone → your application.
- Reduces internet hops → lower latency, more consistent performance.
- Automatic failover between Regions in < 30 seconds.
- Works with ALB, NLB, EC2, and Elastic IP endpoints.
- $0.025/hr + $0.015-0.035/GB (premium data transfer).
- vs CloudFront: CloudFront caches content. Global Accelerator optimizes the network path (no caching). Use GA for non-HTTP (TCP/UDP) or when caching isn't useful (dynamic APIs).
Data Transfer Costs (often the surprise on AWS bills):
- Inbound: free (data into AWS).
- Same AZ: free (using private IP).
- Cross-AZ: $0.01/GB each way ($0.02/GB round trip).
- Cross-Region: $0.02/GB (varies by Region pair).
- Internet outbound: $0.09/GB (first 10 TB, then tiered). CloudFront is cheaper ($0.085/GB).
- NAT Gateway processing: $0.045/GB (on top of data transfer!).
# ── Create Gateway Endpoint for S3 (FREE!) ──
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-private1 rtb-private2 \
--vpc-endpoint-type Gateway
# ── Create Interface Endpoint for SQS ──
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--service-name com.amazonaws.us-east-1.sqs \
--vpc-endpoint-type Interface \
--subnet-ids subnet-private1 subnet-private2 \
--security-group-ids sg-endpoint \
--private-dns-enabled
# ── Global Accelerator ──
aws globalaccelerator create-accelerator \
--name my-app-accelerator \
--ip-address-type IPV4 \
--enabled
# Add listener
aws globalaccelerator create-listener \
--accelerator-arn arn:aws:globalaccelerator::123:accelerator/abc \
--port-ranges FromPort=443,ToPort=443 \
--protocol TCP
# Add endpoint group (ALBs in two Regions)
aws globalaccelerator create-endpoint-group \
--listener-arn arn:aws:globalaccelerator::123:accelerator/abc/listener/def \
--endpoint-group-region us-east-1 \
--endpoint-configurations \
EndpointId=arn:aws:elasticloadbalancing:us-east-1:123:loadbalancer/app/my-alb/xxx,Weight=70 \
EndpointId=arn:aws:elasticloadbalancing:eu-west-1:123:loadbalancer/app/my-alb-eu/xxx,Weight=30
# ── Data transfer cost optimization ──
# Scenario: Lambda in private subnet calls S3 via NAT Gateway
# Without VPC endpoint:
# Lambda → NAT Gateway → Internet → S3
# Cost: $0.045/GB (NAT) + $0.09/GB (internet) = $0.135/GB
#
# With S3 Gateway Endpoint:
# Lambda → VPC Endpoint → S3 (AWS backbone)
# Cost: $0.00/GB (free!)
# Savings: 100%!
# ── Monitor data transfer ──
aws ce get-cost-and-usage \
--time-period Start=2026-05-01,End=2026-05-31 \
--granularity MONTHLY \
--filter '{"Dimensions":{"Key":"USAGE_TYPE","Values":["DataTransfer-Out-Bytes"]}}' \
--metrics BlendedCost
# ── Architecture: Minimize cross-AZ data transfer ──
# Bad: Web server in AZ-a, Redis in AZ-b
# Every cache call: $0.01/GB cross-AZ each way
# 10TB/month cache traffic = $200/month in data transfer alone
#
# Good: Co-locate in same AZ, or accept cross-AZ for HA
# Use private IPs (not public IPs — public IPs go via IGW, cost more)
# ── Comparison: Global Accelerator vs CloudFront ──
# Feature | Global Accelerator | CloudFront
# Caching | No | Yes
# Static IPs | Yes (2 anycast) | No
# Protocol | TCP, UDP | HTTP, HTTPS, WebSocket
# Use case | Dynamic APIs, gaming | Static content, web apps
# Failover | < 30 seconds | Origin failover
# Cost | $0.025/hr + data | Data transfer only
A company's monthly AWS bill showed $15,000 in data transfer costs. Investigation found: (1) Lambda functions calling S3/DynamoDB through NAT Gateway — $8,000 in NAT processing fees. Adding S3 Gateway Endpoint (free) and DynamoDB Gateway Endpoint (free) eliminated the NAT fees. (2) Public IP communication between instances in the same VPC — $3,000. Switching to private IPs reduced this to $500 (cross-AZ only). (3) API traffic from Europe going through us-east-1 — added Global Accelerator for 30% latency improvement and better routing. Total savings: $10,500/month.
How does AWS PrivateLink work for exposing your own services to other VPCs and accounts?
Frequently Asked Questions
The most common AWS interview questions cover EC2 instance types and Auto Scaling, S3 storage classes and security, VPC networking (subnets, Security Groups, NACLs), IAM roles and policies, Lambda and serverless architecture, RDS Multi-AZ vs Read Replicas, DynamoDB partition keys and performance, CloudFormation infrastructure-as-code, and cost optimization with Reserved Instances vs Savings Plans vs Spot.
We cover 40 in-depth AWS interview questions spanning Basic to Performance levels. Each question includes 6 sections: theory, code/config examples (AWS CLI, CloudFormation, boto3), real-world scenario, key takeaway, common mistake, and follow-up question.
These questions cover topics from the AWS Solutions Architect Associate (SAA-C03), Developer Associate (DVA-C02), and SysOps Administrator Associate (SOA-C02) exams. Advanced and Experienced questions also overlap with the Solutions Architect Professional (SAP-C02) exam topics.
Senior and architect interviews focus on multi-region architecture, Well-Architected Framework (6 pillars), disaster recovery strategies (RPO/RTO), AWS Organizations with SCPs, Landing Zone design, cost optimization at scale, data lake architecture, migration strategies (7 Rs), and performance tuning across compute, storage, and networking.
Yes. We cover CloudFormation, CodePipeline/CodeBuild/CodeDeploy (CI/CD), ECS vs EKS container orchestration, CloudWatch monitoring and observability, X-Ray distributed tracing, Auto Scaling, and infrastructure automation — all key DevOps interview topics.
The AWS Well-Architected Framework provides 6 pillars for building reliable cloud architectures: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. AWS interviews frequently ask about trade-offs between these pillars and how to apply them in real designs.