Every cloud provider just became an AI company. AWS has Bedrock. Google has Vertex AI. Azure has OpenAI baked into everything. They're all falling over themselves to give you managed AI services.
How generous.
Here's what nobody's saying out loud: the AI services gold rush is the most sophisticated lock-in play in cloud history. And most companies are walking into it eyes wide shut.
The old lock-in was clumsy. Proprietary databases. Non-standard APIs. Egress fees that made moving data feel like crossing a border with a cargo ship. You could see it coming, and most teams at least knew they were making a tradeoff.
The AI lock-in is different. It's subtle, and it works because it disguises itself as innovation.
Here's how it goes:
Step 1: "Just use our managed service." Your ML team needs GPU clusters. Instead of managing Kubernetes with GPU nodes, why not use SageMaker? Or Vertex AI? Or Azure ML? It's faster to get started. Less ops burden. Makes total sense. Step 2: Custom integrations everywhere. Now your training pipelines are wired into the provider's MLOps tooling. Your feature store is their feature store. Your model registry is their model registry. Your inference endpoints use their proprietary serving infrastructure with their custom autoscaling. None of this is portable. Step 3: Data gravity does the rest. Your training data — terabytes of it — lives in S3, or GCS, or Azure Blob. Moving it costs real money. The models trained on that data are stored in the provider's format. The inference logs that feed your next training cycle? Same bucket. You're not using a cloud anymore. You're living inside one.I watched this exact sequence play out at three different companies in the last six months. One of them estimated the cost to migrate their AI workloads at $2.4 million and 14 months of engineering time. They've been on their provider for two years.
Two years. $2.4 million in switching costs. That's not a vendor relationship. That's captivity.
Let's talk about what this costs in practice, because vague warnings don't change behavior.
A mid-stage startup I worked with last quarter was spending $340K/month on AWS, with about $120K of that on AI-related services. Here's the breakdown:
Total AI waste: roughly $55K/month. $660K/year. And they couldn't move any of it without a rewrite.
Compare that to a team I know running equivalent workloads on bare GPU instances with Kubernetes, open-source MLflow, and direct API access to model providers. Their cost for similar inference volume and training cycles: about $65K/month. Same results. Half the spend. And they can move to any provider in a weekend.
Every cloud provider runs the same playbook, just with different branding.
SageMaker Studio notebooks default to ml.t3.medium instances. Seems cheap at $0.05/hr. But the moment your data scientist needs GPU access, they click "change instance" and land on a $5/hr machine. There's no cost estimate. No warning. No "this will cost $3,600 if left running for a month." The convenience is the trap.
Google's Vertex AI does the same thing. Azure ML Studio does the same thing. They've all optimized the path to spinning up expensive resources and complicated the path to understanding what you're paying.
Use Bedrock? Now you need IAM roles, VPC endpoints, CloudWatch logging, and AWS-specific SDKs. Use Vertex AI? Now you need Google Cloud IAM, VPC Service Controls, and GCP-specific client libraries. Every integration point is a lock-in point.
I've heard engineers say "but it's just an API call." No, it's not. It's an API call that requires provider-specific auth, runs through provider-specific networking, logs to provider-specific monitoring, and stores results in provider-specific storage. That's not an API call. That's an ecosystem dependency.
"Save 40% with a 1-year commitment on SageMaker!" Sounds great until you realize you're committing to a specific instance type on a specific service that's changing every six months. The AI hardware cycle is 12-18 months. You commit to p4d instances today, NVIDIA drops the next generation in eight months, and you're locked into last year's hardware at a "discount" that's actually a premium compared to what's available.
I've seen teams with $200K in unused SageMaker commitments because they shifted to a different instance family mid-year. That's not savings. That's a pre-paid loss.
I'm not saying "avoid managed AI services." That's impractical advice from people who've never shipped anything. But there's a difference between using a service and surrendering to it.
Abstract your ML pipeline. Use open-source tooling — MLflow, Kubeflow, Ray — as your orchestration layer. Let the cloud provide compute, not workflow. If your training job can't run on a different provider with a config change, you've built a cage. Negotiate AI pricing separately. Most companies negotiate their compute and storage commitments but let AI services ride at list price. Bedrock, SageMaker, and Vertex AI all have enterprise pricing that isn't published. Ask for it. If your AI spend is over $50K/month, you have leverage. Audit the markup. For every managed AI service, calculate what the equivalent self-managed cost would be. Not to necessarily switch — but to know your "convenience premium." If it's 15-20%, that might be worth it. If it's 60%, you've got a problem. Set GPU utilization alerts at 30%. Most teams don't monitor GPU utilization on managed instances because the provider doesn't surface it prominently. Pull the metrics yourself. If your training instances are below 30% GPU utilization, you're burning money on silicon that's doing nothing. Review AI commitments quarterly, not annually. The AI landscape moves too fast for annual commitment cycles. If your provider won't offer 3-month or 6-month terms, that tells you everything about who benefits from the commitment.Cloud providers aren't stupid. They know AI workloads are the fastest-growing segment of cloud spend. They know teams are moving fast and not reading the pricing page. And they know that once your ML pipeline is embedded in their ecosystem, you're not leaving.
The irony is that AI was supposed to commoditize cloud. Open models, standard APIs, containerized workloads that run anywhere. Instead, every provider is racing to build proprietary wrappers around commodity technology.
Don't let the wrapper become the product. The compute is the product. The models are the product. Everything else is packaging — and you're paying a premium for the box.
---
Sam Greene is a FinOps practitioner and founder of FinOps Fanatics. She helps teams stop bleeding cloud cash — especially the teams that don't realize AI is the new leak. Running AI workloads and watching the bill climb? Connect with Sam on LinkedIn — let's compare notes.