Two Tales of Compute: The Battle for AI's Operational Future (Part 2)
Training compute sets the frontier, but inference compute determines profitability. In Part 2, I examine efficiency breakthroughs, vertical integrations, and the AI boom's fragile foundations.

In Part 1 of this compute series, I examined how Oracle’s historic $244 billion single-day market cap surge signaled a fundamental transformation in AI economics. The catalyst was a staggering $455 billion compute backlog anchored by OpenAI’s $300 billion commitment. This highlighted how AI has shattered the Moore’s Law paradigm that governed technology for six decades.
Where Moore’s Law promised exponentially cheaper compute every two years, AI’s transformer architecture demands exponentially more expensive compute for linear capability improvements. GPT-5’s estimated $500 million training cost and projections of 10x increases for next-generation models illustrate this inversion: we’ve moved from exponential improvement at declining costs to exponential costs for incremental gains.
This shift has created an oligopolistic training compute market where access to massive, synchronized GPU clusters that require 100+ MW data centers and billions in capital determines competitive survival. The interdependencies are stark: OpenAI relies on Microsoft’s Azure and NVIDIA’s chips, Anthropic depends on AWS’s custom Trainium processors, and even “independent” players like xAI must partner with Oracle while building their own infrastructure.
This creates a precarious equilibrium: if any part of this intricate web of partnerships buckles, the economic collapse could be massive. Yet no player can afford caution because waiting means falling irrecoverably behind in the capital-intensive race that AI has become.
Part 2: Inference Compute – The Economics of Usage
If training compute is the “capex” of AI, inference compute is its operational lifeblood.
Every query processed, every agentic action executed, every token generated represents an ongoing “opex” that compounds relentlessly. The economics here operate under different physics than training: while training happens once, inference happens billions of times daily, transforming the cost structure from a one-time capital investment to a perpetual operational burden that defines unit economics.
The mathematics of inference reveal why efficiency has become existential.
A single ChatGPT query reportedly costs approximately $0.003 in compute, seemingly negligible until multiplied by OpenAI’s 800 million weekly users generating multiple queries each. The full calculation becomes sobering: assuming five queries per user weekly, that’s 4 billion queries, or $12 million in weekly compute costs. That’s $624 million annually just for inference.
Over a model’s operational lifetime, inference costs routinely exceed training costs by factors of three to ten.
This reality has spawned an entire ecosystem of optimization techniques, each attacking different aspects of the inference cost structure:
Quantization reduces precision from FP16 to INT8 or even INT4, cutting compute requirements by 75% with minimal accuracy loss.
Knowledge distillation creates smaller “student” models that approximate larger “teacher” models at a fraction of the cost.
Dynamic batching groups queries to maximize GPU utilization from the typical 30% to over 80%.
Techniques like FlashAttention reduce memory bandwidth requirements by 10x through clever algorithmic optimizations.
Unlike NVIDIA’s training monopoly, where H100s dominate, the hardware landscape for inference has fragmented. The Cambrian explosion of specialized silicon is all chasing the inference economics that will determine AI’s unit costs:
AWS Inferentia2 claims to deliver 40% better price-performance.
Google’s TPUv5e offers 2.5x efficiency gains.
Startups like Groq claim 10x latency improvements. Hyperscalers desperately want to own inference because it represents the recurring revenue stream that justifies their massive infrastructure investments. Unlike training, which happens episodically, inference generates predictable, high-margin revenue streams that compound with user growth.
Microsoft’s Azure OpenAI Service, processing an estimated 100 billion tokens daily, generates projected annual revenues exceeding $4 billion at 70% gross margins after accounting for infrastructure costs. Google’s Vertex AI, AWS’s Bedrock, and Oracle’s new AI Database service all represent attempts to capture this inference value chain.
The stickiness of inference creates powerful lock-in effects that explain why inference pricing has remained surprisingly stable despite improving hardware efficiency. Once applications integrate with an inference API, switching costs become prohibitive.
The agentic paradigm fundamentally transforms inference economics in ways we’re only beginning to understand.
As detailed in our analysis of agentic cloud requirements, agents don’t just query once. They chain dozens or hundreds of steps together, multiplying token consumption by 10x to 100x compared to traditional chat interfaces. An agent booking a flight might generate 50,000 tokens across research, comparison, booking, and confirmation steps, versus 500 tokens for a simple ChatGPT query.
More critically, agentic workloads exhibit different patterns: they require persistent memory states consuming gigabytes of high-speed storage, necessitate sub-100ms latency for responsive interaction, and demand guaranteed availability since agent failures cascade through entire workflows.
This explosion in inference demand explains why efficiency breakthroughs like Qwen3-Next represent not just incremental improvements but potential paradigm shifts. Its radical architecture could fundamentally reprice agentic economics. By activating only 3 billion of 80 billion parameters through an ultra-sparse mixture of experts, Qwen3-Next achieves 10x faster inference on long contexts. When inference workloads grow exponentially, linear efficiency gains become survival imperatives.
The company that reduces inference costs by 90% does more than just improve margins. It enables entirely new use cases that were previously economically impossible.
Part 3: The Race Ahead
The trajectory of AI compute points to three interlocking races that will determine the industry’s structure for the next decade. Each represents a different dimension of the same underlying struggle: controlling the means of intelligence production in a post-Moore’s Law world.
As I noted in the first edition of this compute mini-series yesterday, the big players have created an extraordinary web of interdependencies.
They are now trying to disentangle themselves from each other, or at least re-balance the power dynamics in these relationships. As such, the first race centers on vertical integration, as every player scrambles to unify training and inference under one roof.
OpenAI fired a massive salvo across the entire industry’s bow on Tuesday. First came the announcement that Nvidia will invest $100 billion in OpenAI, which will deploy 10 gigawatts of compute starting next year. Meanwhile, OpenAI, Oracle, and SoftBank disclosed that development of five new U.S. AI data center sites under their Stargate joint venture is running ahead of schedule, putting the partners on track to hit their target of 7 gigawatts of new compute capacity by the end of this year.
Sources suggest OpenAI seeks to build its own training clusters, recognizing that Microsoft’s infrastructure, while vast, must serve multiple customers with competing priorities.
Anthropic’s exploration of custom chip development with AWS follows the same logic, with reports of a ‘Rainier’ compute cluster specifically optimized for constitutional AI training methods.
The hyperscalers themselves are vertically integrating in reverse:
Microsoft’s Phi models, trained on synthetic data to significantly reduce compute requirements, with improvements up to 50% in specific benchmarks, represent an attempt to commoditize the model layer.
Google’s Gemini family leverages TPU advantages to offer superior price-performance
Amazon’s Titan series aims to provide “good enough” models that keep customers within the AWS ecosystem.
This consolidation leaves neocloud companies in an increasingly narrow position over the medium term. As enterprise inference migrates to hyperscalers with superior SLAs and model providers internalize their compute needs, the independent GPU cloud becomes a transitional phenomenon rather than a permanent fixture.
The second race revolves around efficiency breakthroughs that could reset the entire cost structure.
At the model level, Alibaba’s Qwen3-Next represents the vanguard of a new architectural paradigm. Its hybrid attention mechanism can maintain accuracy while activating only 3.7% of parameters. The ultra-sparse Mixture of Experts design, with 512 experts but only 11 active per token, pushes sparsity to theoretical limits while maintaining quality through sophisticated routing mechanisms. Multi-token prediction, generating multiple outputs simultaneously, accelerates inference by predicting rather than sequentially, a technique now being adopted by Meta’s Llama 3.2 and Google’s latest Gemini models.
The open-source nature of these innovations accelerates diffusion and prevents any player from maintaining efficiency advantages for long. Consider that Qwen3-Next garnered rapid adoption, with millions of downloads reported shortly after release.
These aren’t marginal optimizations. They’re order-of-magnitude improvements that could fundamentally reprice AI consumption.
The third race—and perhaps most underappreciated—is Oracle Cloud Infrastructure’s positioning as the “Switzerland of AI.” Oracle’s success defies conventional wisdom about cloud market dynamics. While AWS, Azure, and GCP fight for dominance, Oracle has carved out a unique niche as the neutral platform where competitors collaborate.
Yet, even Oracle’s strategic positioning can’t escape physical constraints.
A 100-megawatt data center consumes roughly 2 million liters of water daily for cooling. That’s equivalent to 6,500 households. Microsoft has addressed this with closed-loop liquid cooling at Fairwater, achieving zero water waste for 90% of capacity, but most facilities lack such sophistication. Phoenix expects its data center power capacity to grow 500% in the coming years, enough to support 4.3 million households. Virginia, with 50 new data centers planned, requires Dominion to triple the state’s power grid. These aren’t just engineering challenges—they’re hard physical limits on how fast the infrastructure can scale.”
Conclusion: Compute Will Make or Break AI
This new reality bifurcates into two distinct but interrelated races.
Access to training compute remains rare, oligopolistic, and frontier-setting. It’s the domain of massive capital deployments and strategic partnerships that determine who can push the boundaries of intelligence. The players are few, the stakes enormous, and the barriers to entry grow higher daily.
When training a competitive foundation model requires $500 million in compute, the game is limited to those with extraordinary resources or partnerships.
In contrast, as inference compute becomes ubiquitous, efficiency-driven, and adoption-defining, the daily battlefield is fought over how many pennies per query compound into billions in profit or loss. Here, innovation in efficiency matters more than raw scale, and clever optimization can disrupt established players.
The agentic revolution amplifies these dynamics to the breaking point.
When agents chain hundreds of reasoning steps, when they maintain persistent memory states requiring gigabytes of high-speed storage, when they coordinate in multi-agent swarms exchanging millions of messages, the computational requirements explode exponentially. A single agentic workflow can consume more compute than a thousand traditional queries. The companies that survive this transition won’t necessarily be those with the best models or the most innovative algorithms. There will be those who master the brutal economics of compute in a post-Moore’s Law world.
The AI industry’s structure is crystallizing around compute control, but this crystallization creates systemic risks. The entire edifice rests on faith—and funding.
Pessimistic? Perhaps.
But it’s also clear-eyed recognition that today’s compute economy rests on the narrowest of foundations: the belief that scaling laws will continue to deliver, that AGI remains achievable, and that someone will eventually pay for all this compute.
The efficiency breakthroughs represented by Qwen3-Next and similar innovations don’t eliminate compute requirements—they expand the addressable market by making previously impossible use cases economically viable. When inference costs drop by 90%, applications that were fantasy become reality. This is the paradox of efficiency in AI: making compute cheaper doesn’t reduce demand, it explodes it by enabling new consumption patterns we haven’t yet imagined.
The race ahead isn’t about ideas or innovation in the abstract. It’s about securing capacity, driving efficiency, and achieving integration across the stack. Those who control training compute will determine what intelligence is possible. Those who optimize inference will determine what intelligence is practical. And those who integrate both will capture the value.
The fight for AI dominance continues to move at a relentless pace. The events of this week have only underscored the theme of compute as a dominant one in the AI race.
Now listen to OpenAI President Greg Brockman speaking on CNBC about the Nvidia investment:
“You really want every person to be able to have their own dedicated GPU, right? So you’re talking on order of 10 billion GPUs. We’re going to need this deal we’re talking about, it’s for millions of GPUs. We’re still three orders of magnitude off of where we need to be. So we’re doing our best to provide compute availability, but we’re heading to this world where the economy is powered by compute, and it’s going to be a compute-scarce one.”
Subsequently, CEO Sam Altman published a short but sweeping manifesto called “Abundant Intelligence”:
“To be able to deliver what the world needs—for inference compute to run these models, and for training compute to keep making them better and better—we are putting the groundwork in place to be able to significantly expand our ambitions for building out AI infrastructure...If AI stays on the trajectory that we think it will, then amazing things will be possible. Maybe with 10 gigawatts of compute, AI can figure out how to cure cancer. Or with 10 gigawatts of compute, AI can figure out how to provide customized tutoring to every student on earth. If we are limited by compute, we’ll have to choose which one to prioritize; no one wants to make that choice, so let’s go build.”
In other words, there is no stopping this investment. As I wrote in Part 1 yesterday: “And yet, none of them can afford to be cautious. Waiting means falling irrecoverably behind.” The discontinuity is here. The old laws no longer apply. In this new world, compute isn’t just an input to AI.
It’s the fight to build the moat where AI’s future will be determined.
Is this a bubble? This continues to be the wrong question to ask. For some companies, yes, it probably is. For the ultimate winners (assuming there are winners), no, this will look like they were undervalued.
However, the risks of this “Damn the torpedoes, full speed ahead” cannot be understated. My goal, as always, is to step back and understand the larger view through the lens of discontinuity. In that framing, this week continued to sharpen the outlook into the future.
Read our complete Agentic Era series for deeper analysis of AI’s structural transformation with Agentic AI.