Your GPU Bill Is a Black Box (And That's By Design)

Post Cover

I talked to a VP of Engineering last month who told me his GPU spend had gone from "rounding error" to "the single largest line item on the bill" in under six months.

He couldn't tell me what it was being used for.

Not because he didn't care — he'd been trying to figure it out for weeks. The cost data was there, technically. Somewhere between SageMaker charges, EC2 p5 instance hours, EBS throughput for model checkpoints, data transfer between training nodes, and S3 storage for every iteration of every model his team had ever experimented with.

The bill was $340K/month. His best guess at what was "necessary" vs. "waste"? He had no idea.

Welcome to the GPU cost management era. It's worse than you think.

The Old Rules Don't Apply

Traditional cloud cost management has a playbook. Identify idle resources. Right-size instances. Buy commitments for steady-state workloads. Schedule dev environments. It's not sexy, but it works.

GPU workloads break every single one of those assumptions.

Idle detection doesn't work. A p5.48xlarge sitting at 2% GPU utilization might be "idle" — or it might be loading a 70-billion parameter model into memory, which takes 20 minutes before training even starts. Kill it and your ML engineer loses a day of work. That $98/hour instance suddenly looks cheap compared to the productivity hit. Right-sizing is a nightmare. With CPU instances, you can usually drop from an m5.4xlarge to an m5.2xlarge and call it a day. With GPU instances, you're choosing between fundamentally different hardware architectures. An A10G is not a small H100. The performance cliff between GPU tiers is brutal, and the wrong choice means your training job takes 4x longer — which means you end up paying more, not less. Commitments are terrifying. A 1-year Reserved Instance for a p5.48xlarge is a $500K+ commitment. For workloads that might shift to a completely different instance family when the next GPU generation drops in six months. I've watched companies lock into p4d reservations right before p5 launched and effectively pay a premium to run on last-gen hardware. Scheduling doesn't scale. "Just turn off dev overnight" works for web apps. It doesn't work when your training job takes 72 hours and you can't checkpoint it without losing 8 hours of progress.

Why The Bill Is Deliberately Opaque

Here's the part that makes me angry.

Cloud providers have zero incentive to make GPU costs transparent. None. AWS doesn't break out your SageMaker bill into "time spent actually training" vs. "time spent waiting for data" vs. "time your notebook was open but nobody was at the keyboard." They bill you by the hour for the instance, and what happens during that hour is your problem.

This isn't an oversight. It's a business model.

When GPU instances cost $30-$100/hour, opacity is extraordinarily profitable. Every hour your team can't distinguish between productive compute and waste is money in the provider's pocket. And the tools they give you to analyze it? Cost Explorer shows you spent $340K on SageMaker. Cool. What are you supposed to do with that information?

Compare this to how they handle CPU workloads. AWS Compute Optimizer will tell you "this m5.xlarge is averaging 12% CPU utilization, consider downsizing." Where's the equivalent for GPUs? Where's the tool that says "this training job used 85% GPU for 6 hours then sat idle for 18 hours before someone remembered to stop the instance"?

It doesn't exist. Not because it can't — because it shouldn't, from their perspective.

The Three Traps I See Everywhere

After working with a dozen companies running serious AI workloads, the same patterns keep showing up:

1. The Experimentation Graveyard

Every ML team runs experiments. Most experiments fail. That's fine — it's how the work gets done. What's not fine is that the infrastructure from failed experiments lives forever.

Training clusters that finished three weeks ago but nobody terminated. Model artifacts consuming terabytes of S3 storage for experiments nobody will ever revisit. Endpoint deployments for models that were superseded two iterations ago but are still serving zero traffic.

One company I worked with had $80K/month in "zombie AI infrastructure" — resources tied to experiments that ended weeks or months ago. Nobody owned the cleanup because the ML engineers had moved on to the next experiment and the platform team didn't know what was safe to delete.

2. The GPU Hoarding Problem

GPU instances are scarce. Everyone knows this. So teams request them early and hold them longer than needed "just in case." An ML engineer who fought for two days to get a p5 allocation isn't going to release it during lunch.

This creates a tragedy of the commons. Internal utilization on GPU instances is typically 30-40% — not because the work doesn't need GPUs, but because the incentive structure rewards hoarding over efficiency. When the alternative is a 48-hour wait to get your instance back, you keep it running through the weekend.

3. The "We'll Optimize Later" Spiral

AI projects start as experiments. Experiments don't get optimized — they get funded. By the time the workload is "production," the architecture is three layers of duct tape built during a proof-of-concept that was never supposed to run for more than a month.

Now you've got a production inference endpoint running on p5 instances because that's what the model was trained on, even though inference could run on a g5 at 1/10th the cost. But migrating means re-validating the model, which means the ML team needs two weeks, which means it goes on the backlog, which means it never happens.

I've seen inference workloads running on training-grade hardware for over a year because nobody prioritized the migration.

What Actually Works

I'm not going to pretend this is easy. GPU cost management is genuinely harder than traditional cloud cost optimization. But it's not impossible.

Track GPU utilization, not just instance uptime. NVIDIA's DCGM and cloud provider metrics can show you actual GPU utilization. If your p5 instances average 20% GPU utilization over a week, you have a scheduling problem, a sizing problem, or both. You need this data before you can make any decisions. Set lifecycle policies on everything. Training jobs should have maximum durations. Endpoints should have traffic thresholds below which they auto-terminate. Experiment artifacts should have TTLs. The default state for AI infrastructure should be "off" — make teams actively justify keeping things running, not the other way around. Separate training from inference. These are fundamentally different workload profiles with fundamentally different cost structures. If your inference is running on training hardware, you're probably overpaying by 5-10x. This one migration often pays for the entire cost optimization effort. Build a GPU chargeback model. When GPU time is "free" to teams, it gets wasted. When teams see that their experiment cost $12K, they start asking whether they really need to run the full hyperparameter sweep or whether a targeted search would work. Visibility changes behavior — but only when the bill goes to the people making the decisions. Don't commit yet. The GPU landscape is changing every 6 months. NVIDIA's next generation, AMD's MI300, AWS's Trainium — the hardware options are shifting too fast for long-term commitments. Use Spot for training (with good checkpointing), on-demand for the rest, and revisit commitments when the market stabilizes.

The $340K Question

That VP of Engineering I mentioned? We got his GPU spend to $195K/month in six weeks. Not by deploying a new tool or building a dashboard. By doing the work: killing zombie infrastructure, separating training from inference, implementing lifecycle policies, and building a chargeback model that made teams accountable.

$145K/month in savings. $1.74M annualized. From workloads that had been running for less than a year.

The GPU cost management problem isn't a technology problem. It's an ownership problem wrapped in an opacity problem, running on hardware that's too expensive to ignore and too new for anyone to have best practices.

If you're running AI workloads and you don't know exactly where your GPU dollars are going, you're not alone. But "everyone's confused" isn't a strategy.

Start looking. The waste is there. It always is.

---

Sam Greene is a FinOps practitioner and founder of FinOps Fanatics. Book a free consult to find out where your cloud spend is hiding.
LET US HELP YOU
CUSTOMER
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Prefer to email us directly? support@finfan.cloud

We typically respond within 24 hours during business days.