Case Studies · Last updated June 2026
Case Study: Cutting AWS Bills 20–40% Across 10+ Client Accounts Without Downtime
I've run AWS cost optimization reviews on more than ten client accounts. Each client owns a separate AWS account, so there's no Organizations-level trick that fixes everything at once and no shared Reserved Instance pool to lean on. What I have instead is a method that repeats well from account to account. This page documents it.
The Numbers
- 20–40% monthly cost reduction per environment, depending on how much waste had accumulated
- One anonymized mid-size client: roughly $3,800/month down to roughly $2,300/month
- Zero downtime and zero performance regressions across every change
- Non-production compute runtime hours cut ~65% through scheduling
- Costs stayed flat afterwards because the fixes were codified into Terraform module defaults rather than applied by hand
How Ten Different Accounts Accumulate the Same Waste
Every account I reviewed had the same shape of problem: years of deferred cleanup. Instances were sized "to be safe" at launch and never revisited. Dev and staging ran 24/7 for teams that work eight hours a day. Unattached EBS volumes and stale snapshots lingered from servers that no longer existed. Volumes created years ago were still on gp2, paying the old per-GB rate because nobody had done the gp2 to gp3 migration.
Then the quieter stuff. NAT gateway data-processing charges that nobody had attributed to any workload. CloudWatch log groups with retention set to never expire, growing without limit. And no cost allocation tagging anywhere, which meant nobody could answer the basic question of which service cost what, so the bill was a single opaque number that kept going up.
The Rules I Worked Under
- These are production client workloads. No downtime windows, and no change that puts performance at risk.
- Every change must be explainable to a non-technical client who pays the bill.
- Each client has their own AWS account, so there is no org-wide RI or Savings Plan sharing. Every account gets optimized on its own.
- Fixes must be structural. If a saving depends on someone remembering to run a cleanup script every quarter, it decays.
Step 1: Visibility Before Cuts
The first instinct on a cost project is to start deleting things. I don't. If you cut before you can measure, you can't show the client what changed and you can't catch the next regression.
AWS cost allocation tags, enforced in CI
Every resource gets three tags: Project, Environment, and ManagedBy. The shared Terraform modules apply them automatically, and a tag-policy check in CI fails the plan when a resource is missing them — the same OPA/conftest gates I described in the DevSecOps pipeline case study. Tagging that relies on people remembering it stops working within weeks, which is why the check lives in the pipeline.
Budgets, alerts, and Cost Explorer
Each account gets AWS Budgets with alert thresholds, so overspend notifies someone instead of appearing on next month's invoice. Once tags are active, Cost Explorer grouped by tag finally shows where the money goes. In practice, every account's top five line items turned out to be the same: compute, RDS, NAT, EBS and snapshots, and logs. I used that list as the working checklist for the steps below.
Step 2: The Zero-Risk Cleanup
Before touching anything that serves traffic, there is a whole category of waste that can be removed with no risk to the workload at all:
- Delete unattached EBS volumes. They bill at full price whether or not anything reads them.
- Put snapshot and AMI lifecycle policies in place so old images age out automatically.
- Remove idle load balancers and release unused Elastic IPs.
- Migrate gp2 volumes to gp3.
- Set CloudWatch Logs retention to 14–90 days depending on environment, replacing the default of never expiring.
The gp2 to gp3 migration deserves its own mention because people assume storage changes need downtime. They don't. gp3 is roughly 20% cheaper per GB than gp2, the change is a single volume modification, and it happens online while the volume stays attached and in use.
Step 3: Right-Sizing EC2 and RDS From Utilization Data
Right-sizing EC2 is where cost work gets its bad reputation, because aggressive downsizing causes the performance incidents that make clients distrust the whole exercise. I only downsized where CloudWatch metrics and AWS Compute Optimizer agreed that p95 utilization justified it. Peak utilization, because averages hide the Monday-morning spike.
Fargate tasks got the same treatment from container-level metrics, with CPU and memory trimmed to what the containers actually use. The cadence was always one size down, then observe. If a downsize held up under a week or two of real traffic, I'd consider the next step; I never dropped an instance two sizes at once.
Step 4: Stop Non-Production Instances on a Schedule
Dev and staging environments were running around the clock for teams that work business hours. A small Lambda function on EventBridge schedules stops non-production EC2 and RDS instances in the evening, starts them before the workday, and leaves them off over the weekend. That alone cut non-production compute runtime hours by about 65%.
AWS Instance Scheduler is the packaged alternative for teams that prefer not to maintain their own function. I went with the plain Lambda because the logic fits on one screen and every account already had a pipeline to deploy it through.
Step 5: Architecture Fixes — NAT Gateway Costs and Storage Tiers
Cutting NAT gateway data-processing charges
NAT gateway pricing includes a per-GB data-processing fee that accumulates. In several accounts, ECR image pulls routed through the NAT gateway were a top-three line item: every container deploy was paying NAT rates to fetch images from a service that lives inside AWS. VPC gateway endpoints for S3, which are free, plus interface endpoints for ECR and CloudWatch moved that traffic off the NAT path entirely.
S3 lifecycle tiers for logs and backups
Logs and backups don't need S3 Standard forever. Lifecycle rules transition them to Infrequent Access and then Glacier on a timeline that matches how often anyone actually retrieves them, which for most backups is close to never.
Savings Plans came last, deliberately. A commitment priced against an oversized fleet locks in the waste. Only after right-sizing and scheduling had stabilized each account's baseline did I size a Savings Plan against it, one commitment per account.
Step 6: Codify Every Fix Into Terraform Module Defaults
This is the step that makes the savings hold. Every fix from the steps above became a default in the shared Terraform modules that all client environments are built from — the same module layer described in the multi-environment Terraform case study. New volumes are gp3 unless explicitly overridden. Log groups get finite retention. Plans fail without the required tags. Every existing environment inherits the fix on its next apply, and every future environment starts with it.
# modules/app-service/variables.tf
variable "volume_type" {
description = "EBS volume type. gp3 is ~20% cheaper per GB than gp2."
type = string
default = "gp3"
}
variable "log_retention_days" {
description = "CloudWatch Logs retention. Override per environment."
type = number
default = 30
}
variable "tags" {
type = map(string)
validation {
condition = alltrue([
for key in ["Project", "Environment", "ManagedBy"] :
contains(keys(var.tags), key)
])
error_message = "Tags must include Project, Environment, and ManagedBy."
}
}
# modules/app-service/main.tf
resource "aws_ebs_volume" "data" {
availability_zone = var.availability_zone
size = var.volume_size
type = var.volume_type # gp3 unless overridden
tags = var.tags
}
resource "aws_cloudwatch_log_group" "app" {
name = var.log_group_name
retention_in_days = var.log_retention_days # finite by default
tags = var.tags
}Nothing in the list above requires a recurring cleanup sweep. The defaults do the work.
What the Next Invoices Showed
Across the fleet, the average reduction landed around 30%, with individual environments between 20% and 40% depending on their starting waste. The anonymized mid-size client from the summary went from roughly $3,800 to roughly $2,300 a month, with no downtime and no performance regressions along the way.
The visibility work kept paying off after the project ended. Budget alerts have since caught cost regressions — a misconfigured log level, an orphaned NAT gateway — within days instead of at invoice time. And because everything is tagged, clients now get a per-tag breakdown showing what each project and environment actually costs.
Things I Got Wrong the First Time
On the first account I enforced tagging only partway through, so untagged historical spend made the analysis slow and I was attributing costs from resource naming conventions and educated guesses. Budget alerts went in near the end of that engagement, when they would have been just as useful while the cleanup was still in progress. I also left Graviton as a future item on accounts where stateless workloads could have migrated in the first pass. Since then, cost has become a monthly review on every account, because waste keeps accumulating and a short look at Cost Explorer catches it while it's still small.
AWS / GCP Architecture & Cost Review
I run your account through the same six steps documented on this page, ending with every fix codified in Terraform so the savings hold.
See Services