The $40K/Month Database Nobody Knew Existed (And Other Cloud Horror Stories)

Post Cover

I've spent years helping companies clean up their AWS bills. Along the way, I've collected horror stories.

These are all real. Names and some details changed to protect the guilty.

The $40K/Month Mystery Database

A mid-sized SaaS company hired me to audit their AWS spend. Their bill had grown 300% in two years with no clear explanation.

Digging into RDS costs, I found a db.r5.8xlarge instance: 32 vCPUs, 256 GB RAM. $40,000/month.

"What application uses this?" I asked.

Silence. Nobody knew.

We traced it back through CloudTrail. An engineer had created it two years prior for "testing migration performance." The migration was cancelled. The engineer had left the company. The database sat there, empty, burning $480,000 per year.

Lesson: Someone should review your running resources monthly. Not quarterly. Not yearly. Monthly.

The Accidental Data Exfiltration

A startup's AWS bill jumped from $15K/month to $180K/month overnight. They thought they'd been hacked.

They hadn't. An engineer had misconfigured a data pipeline that was supposed to copy data between S3 buckets in the same region. Instead, it was copying from us-east-1 to... the public internet... and back into us-east-1.

The data transfer charges alone: $165,000.

"But I used a private endpoint!" the engineer said. He had. For one half of the pipeline.

Lesson: Data transfer is the silent killer. Always verify your network paths.

The "Temporary" Load Balancer

During an incident review, I found a company running 47 Application Load Balancers. They had 8 production services.

"Why 47 ALBs?" I asked.

Turns out, over the years, engineers had spun up "temporary" ALBs for testing, debugging, and one-off deployments. Each ALB costs ~$20/month minimum, plus traffic charges.

47 forgotten ALBs = $11,000+/year in pure waste. And that was just the ALBs — they also had associated target groups, security groups, and certificates cluttering up the account.

Lesson: "Temporary" resources need expiration dates.

The Staging Environment That Never Slept

A company prided themselves on having "production-equivalent" staging environments. Same instance sizes. Same database configurations. Same everything.

Including the same costs. 24/7/365.

Their staging environment cost $67,000/month. It was used, optimistically, 50 hours per week.

That's 168 hours per week total. They were paying for 118 hours of staging just... sitting there.

We implemented auto-shutdown for nights and weekends. Savings: $45,000/month.

Lesson: Dev/test environments should sleep when you do.

The Log Explosion

CloudWatch Logs: one of the sneakiest cost centers in AWS.

A fintech company had log retention set to "never expire" on every log group. Seemed sensible — logs are useful!

Over three years, they'd accumulated 47 terabytes of logs. Cost: $1,200/month in storage alone. Plus ingestion costs of $2,300/month for new logs.

Nobody had ever looked at logs older than 30 days. Ever.

We set retention to 30 days on most log groups, 90 days on security-relevant ones.

Immediate savings: $800/month. More as old logs expired.

Lesson: Log retention isn't free. Set policies early.

The Snapshot Cemetery

EBS snapshots seem harmless. They're "just backups."

One company had 23,000 EBS snapshots. Twenty-three thousand.

Total cost: $8,400/month in snapshot storage.

When we audited them:

  • 67% were from instances that no longer existed
  • 22% were from AMIs that were no longer used
  • 8% were automated daily snapshots with no expiration
  • 3% were actually useful

We deleted 21,500 snapshots. The remaining 1,500 were migrated to cheaper tiers where possible.

Savings: $7,200/month.

Lesson: Snapshots need lifecycle policies. Always.

The Graviton That Wasn't

"We're using Graviton everywhere!" the VP of Engineering told me proudly.

They were not.

A quick check showed 94% of their EC2 fleet was still running on Intel instances. The "Graviton migration" had been announced 18 months prior. A few services had moved. Most hadn't.

The reason? No deadline. No accountability. "When we get around to it."

Potential savings from actually completing the migration: 22% of compute costs.

Lesson: Intentions aren't implementations. Track migration progress like any other project.

The Multi-Region Mistake

A startup was running multi-region for "high availability." Admirable!

Except:

  • They had 11 customers total
  • All 11 were in the US
  • Their SLA was 99.9% (achievable with single-region)
  • They'd never actually tested failover

Multi-region cost: 2.3x single region (instances + data transfer).

We consolidated to single-region with multi-AZ. Same effective reliability for their actual requirements.

Savings: 57% of infrastructure costs.

Lesson: Multi-region is expensive. Make sure you actually need it.

Common Themes

Notice what all these stories have in common?

1. Nobody was watching. Resources existed for months or years without review.

2. "Temporary" became permanent. Test resources, debugging tools, one-off experiments — none of them had expiration dates.

3. Defaults were expensive. Log retention, snapshot policies, instance sizes — default settings cost more.

4. Assumptions were wrong. "We need multi-region." "We're using Graviton." "Staging should match prod." All wrong.

5. The fix was simple. Delete, resize, set policies. None of these required complex engineering.

Don't Be a Horror Story

Every month you don't look at your cloud bill, you're probably creating your own horror story.

The good news: the fixes are usually simple.

The bad news: you actually have to look.

LET US HELP YOU
CUSTOMER
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Prefer to email us directly? support@finfan.cloud

We typically respond within 24 hours during business days.