The Cerebras IPO Test, Part II: The Architectural Math Behind a $17-20 Billion Fair Value
Cerebras bet the future of AI on ultra-fast wafer-scale inference. But the agentic economy emerging in 2026 is heterogeneous, CPU-centric, and increasingly optimized for cheaper FP8 compute.

A follow-up to last week’s piece, now that the deal is live. A deep dive into whether the wafer-scale chip is built for the inference economy that has actually arrived.
Cerebras’s wafer-scale architecture was designed for monolithic, high-precision, single-model inference at extreme speed. The Inference Economy that has materialized in 2026 is heterogeneous, orchestrated, multi-model, context-heavy, and CPU-centric. This creates a structural architectural mismatch. Three misalignments limit Cerebras’ durable moat. At ~$66 billion, the stock trades 3-4× our fair-value range of $17-20 billion. The Inference Economy is real. Whether Cerebras’s architecture is positioned to capture its economic surplus remains an open question.
Cerebras Systems went public in the hottest AI-infrastructure market of the modern era. The IPO priced at $185 per share, above an already twice-raised range, on a book more than twenty times oversubscribed, opened at $350, touched $386 intraday, and closed its first session at $311, valuing the company at roughly $66 non-fully diluted.
Last week, with the deal still pricing and reports that it had already been raised twice, we published a report for institutional clients arguing that the market was mispricing Cerebras at a pre-IPO value reaching $49 billion.
Our disagreement is not with the existence of the Inference Economy. We have been bullish on that for nearly two years. Our disagreement is with the architecture that the market believes will dominate it.
The bull case assumes that as inference workloads explode, value will accrue to monolithic ultra-fast compute architectures capable of serving frontier models at extreme speed. The production evidence emerging in 2026 increasingly points in another direction: heterogeneous, orchestrated, multi-model systems in which routing, coordination, retrieval, memory hierarchy, and CPU-bound tool execution matter more than maximizing the throughput of a single forward pass.
That distinction between efficiency over model power is the entire debate.
Cerebras is not a bad company. The engineering achievement is real, the customer book is prestigious, and the revenue growth is genuine. But the Inference Economy that has emerged in 2026 is not the Inference Economy for which its wafer-scale architecture was built.
Now that the deal has been priced and its stock is trading, the institutional analysis we circulated to clients is no longer market-sensitive. This article lays out the architectural argument for our $17–20 billion fair value framework relative to the market's current pricing, which is closer to four times that number.
To see the basis of our structural analysis, we need to walk through three steps: first, understand the market conditions this IPO landed in; second, understand what the inference economy looks like in practice; and third, assess whether Cerebras is aligned with the regime that is taking shape.
Why the chip layer carries the market now
A handful of names matter more than ever to the S&P 500's earnings power. The index is up roughly 8% year-to-date, and technology - semiconductors above all - has driven more than half of that gain. Sandisk has multiplied several times over, Micron has more than doubled, and the chip-stock share of the S&P 500 has reached 18% - more than double its dot-com-era peak and a level not seen in over two decades. When the chip gauge fell sharply on Friday, the broader index fell 1.2% in sympathy: on a roughly 4% semiconductor drawdown, that is a one-day sensitivity of about 30%, which says plainly that a meaningful portion of US equity-market direction now rides on a small number of AI-infrastructure-exposed names.
The reason is straightforward. Amazon, Microsoft, Alphabet, and Meta are projected to spend nearly $700 billion on Capex this year, with some estimates putting cumulative big-tech AI Capex at $5 trillion over the next five years. That spending flows into the chip layer of the AI stack and into the earnings of the companies that supply it. Semiconductor-related S&P 500 companies are tracking an 84% earnings jump in Q1; Micron is on track for 670% earnings growth this year.
None of this is necessarily a bubble. The fundamentals are real. The demand is durable. The spending is committed.
However, the risk of this concentration is just as real. When the markets inevitably turn, names sitting at the intersection of cyclical chip risk, single-customer counterparty risk, and single-architecture obsolescence risk compress faster than the broad index. Cerebras sits at exactly that intersection. Friday’s post-IPO pullback already shows how quickly scarcity premiums can rotate. The discipline for a public-market investor is to separate demand reality from supply identity. The inference economy is creating durable infrastructure demand; the question is whose silicon and which architecture captures the durable surplus inside it.
The “Inference Economy”
The buildout is paying for what I defined nine months ago as the “Inference Economy.” This is a regime in which workflows consume compute as their primary variable input, in which software agents capture the labor budget rather than the IT budget, and in which the infrastructure underneath earns continuous, growing revenue as spending shifts from per-seat software to per-outcome inference.
In that context, the projection that inference will surpass training as the dominant AI workload by 2030 is likely directionally correct.

But the Inference Economy has its specific characteristics. The central claim of the AGNT Manifesto that I recently released was that orchestration is becoming economically larger than intelligence, and that the value in the AI stack would migrate from the raw smartness of the model to the operational layer that strings models, tools, retrieval, and policy together into reliable work.
That claim now has operational evidence to back it up.
Datadog’s State of AI Engineering 2026, based on telemetry from more than 1,000 production customers, offers a high-fidelity picture of the inference economy.
Two findings from that report anchor what follows.
First, context windows are exploding. The average number of tokens per request more than doubled at the median organization year-over-year, and quadrupled at the 90th percentile, as production teams stuff more conversation history, more retrieved documents, more tool outputs, and more policy guardrails into every model call. The model is being asked to reason over a working set growing faster than any silicon roadmap can comfortably absorb.
Second, agents are still largely monolithic at the service-call level: 59% of agentic application requests make only one service call, and only 18% make three or more. The report explicitly notes that teams “know that monoliths don’t scale well and are looking to change,” but the deployed reality today remains predominantly single-agent execution.
That transition matters.
Around those two observations, the report describes an inference stack that is increasingly multi-model, heavily scaffolded, context-saturated, and coordination-driven. More than 70% of organizations now run three or more models, building portfolios that route each workload to the optimal latency, cost, and risk profile. Adoption of agent frameworks (LangGraph, LangChain, Vercel AI SDK, Pydantic AI) has nearly doubled over the past 12 months, from 9% to 18%. 69% of all input tokens are system prompts, including tool guidance, policy definitions, and internal instructions, rather than user inputs. And the binding production failure mode is now capacity, not model intelligence: 60% of LLM call errors in February came from exceeded rate limits, with 8.4 million such errors in March alone.
The agentic workload of 2026 is therefore not simply “more inference.” It is orchestrated inference: multi-model, tool-driven, context-heavy, capacity-constrained, and increasingly heterogeneous.
This is the environment in which Cerebras has gone public.
And Datadog’s first-quarter print is the clearest demand-side signal yet that the Inference Economy is just getting started: 32% revenue growth, and roughly 4,550 customers above $100,000 of recurring revenue, up from 3,770 a year earlier - with about 20% of customers now using AI integrations.
But the existence of the Inference Economy does not automatically translate into a bull case for every inference architecture.
That distinction is the core of the Cerebras debate.
Is Cerebras fit for the Inference Economy? A framework assessment
When assessing the durable value of an AI infrastructure company, we evaluate it through an architectural-resilience framework designed to test how a system performs as the surrounding technological and economic regime evolves over time.
The dimension here is inference-economy fitness: is the architecture aligned to the inference economy that is actually emerging, or is it betting on a regime that is not arriving?
That ordering matters because, in the AI infrastructure, architectural alignment precedes financial fundamentals. The architecture determines whether the financial fundamentals are durable in our conception of what constitutes a “durable growth” moat.
For Cerebras, the answer breaks down into three core arguments.
Argument one: the industry is migrating to lower-precision inference while software closes part of the remaining gap
Modern AI inference increasingly operates at lower numerical precision. The industry has progressively migrated from FP32 to FP16 and now toward FP8 because the efficiency gains are enormous while the accuracy penalties for inference workloads are often acceptable. FP8 effectively halves the memory bandwidth required per token relative to FP16.
This matters for Cerebras because the wafer-scale moat is fundamentally a memory-bandwidth moat.
The wafer-scale architecture keeps 44 gigabytes of ultra-fast on-chip memory directly adjacent to compute, eliminating the inter-chip communication overhead that constrains GPU clusters. That moat is most valuable when memory bandwidth is the binding constraint and the model is run at high precision.
FP8 partially relaxes that bottleneck.
The data is increasingly difficult to ignore. A March 2025 academic study by Kundu et al. that normalizes CS-3 and DGX B200 systems for the same physical space, power consumption, and cost found that Nvidia’s B200 already delivers 3.07× more FP8 throughput per watt per dollar than Cerebras’s WSE-3. At FP16, the gap narrows with Nvidia 1.54× ahead, but still on the wrong side of the moat.
Importantly, wafer-scale wins on raw FP16 throughput per rack by 2.3× if you ignore cost, but a CS-3 system costs roughly $2.5 million each against $0.5 million for a DGX B200 - a ~5× sticker gap. Once you normalize for performance per watt per dollar, the wafer works out roughly 3.3× more expensive than the GPU stack to deliver its ~2.3× throughput advantage. The premium consumes the efficiency.
At the same time, the industry is migrating to the precision tier where Cerebras is most exposed. Nvidia’s Vera Rubin, announced at GTC 2026, was positioned explicitly as the FP8 generation, claiming 5× inference performance over Blackwell. Groq 3 LPX, now inside CUDA following Nvidia’s $20 billion December acquisition of Groq’s leadership, is FP8-native. SambaNova SN50 delivers a claimed 895 tokens per second per user on Llama 3.3 70B at FP8. That’s roughly five times the speed of a B200 at one-third of the total cost of ownership.

A critical second dynamic compounds the pressure: architectural disadvantages can also be closed in software, not only in silicon.
Cursor is one of the world’s largest consumers of agentic AI through its Composer coding platform. In April, the company published warp-decode, a CUDA kernel optimization that delivered a 1.84× throughput improvement on small-batch decoding on the B200. Cursor wrote that kernel because they had hit the exact decode bottleneck that specialty silicon was supposed to solve. The key signal here was that one of the world’s largest AI-inference buyers chose to address the bottleneck within CUDA rather than migrate away from it. Another datapoint to the story.
This is the structural issue for Cerebras. The bear case is not that specialized inference disappears. The opposite is true. Specialized inference is real, growing, and economically important.
The question is narrower: how large the premium ultra-fast inference tier ultimately becomes relative to the broader inference economy, and whether the economic surplus inside that tier remains durable once the rest of the ecosystem adapts around it.
The cost-economics moat for Cerebras is being eroded from both sides at once. The silicon is moving to the precision tier where Cerebras is worst-positioned, and the software ecosystem is closing the residual gap with kernels. The wafer-scale architecture priced into a $66 billion market cap is being out-competed on the units customers buy.
This is where the strongest bull case emerges.
SemiAnalysis argued in Cerebras — Faster Tokens Please on the day Cerebras priced that the unit customers buy is itself shifting. Inference is no longer priced solely on $/million tokens; model providers are tiering by speed.
That argument deserves to be taken seriously.
But the same analysis supplies the hedges that make it a narrow niche rather than a $66 billion thesis. The speed-premium SKU works only for models small enough to fit on the wafer: GPT-5.3-Codex-Spark is a distilled 120B-parameter variant of OpenAI’s real Codex model, more than ten times smaller. SemiAnalysis concludes flatly that its tokens “likely aren’t worth $10 billion today.” Algorithmic compression is moving small-model intelligence forward fast enough to commoditize the niche from within: SemiAnalysis estimates GPT-5.5-level intelligence is less than a year away in a 120B form factor.
The implication is not that Cerebras lacks a real niche. It clearly has one. The speed-premium is a real segment of the inference economy.
The question is whether that niche scales into a durable architecture-wide economic moat large enough to justify a valuation measured in tens of billions of dollars. On present evidence, I do not think it justifies a wafer-scale market cap of $66 billion.
Argument two: the workload is heterogeneous and CPU-centric
The second argument is structural. It asks not whether wafer-scale is too expensive, but whether it is solving the right problem.
An agentic workload is not a single forward pass through a language model. It is a loop: the model reasons, calls a tool, such as a database query, a web search, a code execution, then processes the result, reasons again, calls another tool, until the task completes.
The Datadog data above tells us this is now production-standard. The question for any silicon vendor is: where in that loop is the binding latency? The conventional answer that justifies specialty AI silicon is that the LLM forward pass dominates.
That answer is increasingly wrong. The evidence suggests the bottleneck is migrating outward into orchestration and coordination layers surrounding the model.
A recent paper by Raj, Kundu, Vohra, Wang, and Krishna at Georgia Tech and Intel, Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective, characterized five representative agentic workloads, including Toolformer, SWE-Agent, RAG, ChemCrow, and a LangChain web-augmented agent operating across heterogeneous CPU-GPU systems.

The authors found that tool processing (as shown in the figure above) on CPUs can take a significant chunk of E2E latency, motivating a CPU-centric optimization strategy. Also, the authors argue that a system with a high-performance CPU paired with a low-performance GPU can match a system with a high-performance GPU on tool-dominated work.
Per the paper’s conclusion: “Agentic AI shifts the system bottleneck from monolithic LLM inference toward CPU-resident tool execution and orchestration.” As GPUs continue to get faster, the bottleneck shifts further toward the CPU, not away from it.
This is the architectural inversion confronting wafer-scale systems.
Wafer-scale computing is optimized for the fastest possible execution of a single model on a single contiguous piece of silicon. That was the right answer to the 2023 question, when frontier LLM inference was the workload. It is the wrong answer to the 2026 question, because the inference economy that has emerged spends most of its time outside the LLM forward pass, in the CPU-bound coordination layer where wafer-scale offers nothing.
Two commercial architectures are converging on the same conclusion in real time. In April, Intel and SambaNova jointly announced a heterogeneous blueprint using GPUs for the compute-heavy prefill phase, RDUs for the memory-bandwidth-bound decode phase, and Xeon 6 CPUs for orchestration. Inside CUDA, Vera Rubin paired with Groq 3 LPX delivers the equivalent disaggregation. Andreessen Horowitz’s infrastructure team flagged the same trajectory in Big Ideas 2026, arguing that routing, locking, state management, and coordination are replacing raw forward-pass throughput as the binding constraint of the agentic era.
The honest bull-case complication is the Datadog Fact 7 above: 59% of production agents are still single-service-call. The bull case rests on this not moving.
But even a single-service-call agent runs a heterogeneous workload internally: an LLM call, tool processing, retrieval, and context preparation. And the Raj et al. paper precisely measures CPU dominance within that intra-call profile.
This is the deeper tension for Cerebras. The wafer-scale architecture is monolithic at the silicon level, unlike anything in the deployed production stack, even at its most monolithic. And the Datadog report itself flags that monolithic-agent design is what teams are explicitly “looking to change.”
Argument three: the memory bet
Cerebras’s third architectural exposure is to the memory regime its chip is “locked” into.
In an AI accelerator, the chip and its memory have to talk constantly. The faster the model wants to think, the faster that conversation needs to be. There are two practical options. HBM (high-bandwidth memory) sits next to the chip in a stack. It is large and reasonably fast, enough to hold most modern models with room to spare. Nvidia, AMD, and every hyperscaler ASIC program are built on HBM. SRAM sits directly on the chip itself and is even faster than HBM, but takes up so much physical area that you can never fit very much of it on a chip, unless you make the chip enormous.
That is what Cerebras did. It built a chip the size of an entire wafer, large enough to hold 44 gigabytes of SRAM on-chip. That’s far more than any GPU, and fast enough that the entire chip’s memory can be read at 21 petabytes per second.
The bet behind this design was bold and intuitive: if you can fit the whole model in the fastest memory, you never need HBM. No off-chip data movement. No bottleneck.

For the 2023 generation of frontier models, that was a powerful design advantage. It is becoming the wrong bet for three reasons that have all hardened over the past year.
First, models have outgrown the on-chip ceiling. Trillion-parameter dense models, mixture-of-experts architectures with rising active-parameter counts, and long-context attention caches that grow linearly with sequence length all push the working set well beyond 44 gigabytes. The Datadog finding above, with tokens per request quadrupling at the 90th percentile, is exactly this trend observed in production.
Second, on-chip SRAM has effectively stopped getting denser at advanced process nodes; Cerebras cannot scale SRAM capacity by waiting for the next manufacturing generation.
Third, the rest of the industry is pouring investment into the alternative.
This is where Andreessen Horowitz’s Charts of the Week: Memory to the Moon is worth reading carefully. DRAM contract prices have more than tripled year-on-year as of March. The three leading memory makers - Samsung, SK Hynix, Micron - are expected to sextuple their operating income in 2026, with Micron’s quarterly earnings now exceeding any full pre-2025 year of earnings. Critically, hyperscalers are now signing five-year HBM supply agreements, up from the industry standard of one year, locking HBM in as a multi-year strategic asset. The Capex of the AI buildout is flowing through the HBM stack, and the architectural premium wafer-scale was supposed to extract is migrating instead to the memory triopoly hyperscalers are paying years in advance to secure.
The architectural problem for Cerebras is therefore double. The SRAM-only stance does insulate the company from the HBM supply crunch in the near term, creating a real tailwind. But it also excludes Cerebras from the memory supercycle profit pool and locks the architecture out of the HBM-plus-CXL-plus-flash roadmap the industry is investing trillions to extend. By 2028–2029, the very memory wall that justifies wafer-scale today will have been engineered down to a manageable constraint, while Cerebras’s architecture remains locked.
A brief note on the edge
A fourth, smaller pressure compounds the first three: the share of inference demand that can be served on consumer-grade hardware has risen from 23% to 71% of single-turn queries in two years, per the Stanford and Together AI Intelligence Per Watt study. Heavyweight agentic decode will not run on a Snapdragon. But the long tail is migrating away from any kind of frontier infrastructure, and Cerebras has no edge exposure. Just yesterday, AMD unveiled, per AMD CEO Lisa Su, the world’s smallest AI development PC capable of running 200B-parameter models locally.
One risk among others
The architectural argument above is the one most directly tied to the evolution of the inference economy itself, and I believe it is the one most under-priced by the post-IPO debate. The conversation has been dominated by the more visible counterparty story.
That counterparty story is real and serious.
86% of FY2025 revenue came from the UAE state-backed AI ecosystem of G42 and MBZUAI, related parties under accounting rules. The $20-billion-plus Master Relationship Agreement with OpenAI - the centerpiece of the IPO narrative — covers 750 megawatts through 2028 but contains exclusivity provisions that restrict Cerebras from supporting OpenAI’s named competitors. The very labs that would be the natural diversification path - Anthropic, Google DeepMind, Meta AI, xAI - are contractually closed. The contract that fixes one concentration problem creates another. While this is common practice and hence not surprising, it nonetheless ties a large part of the Company’s fate to that of OpenAI and was, without a doubt, a key factor in allowing the IPO to proceed.
We covered this dimension in detail in our pre-IPO piece and at greater institutional depth in our S-1 teardown.
Other risks compound: cost normalization could close or widen faster than the base case assumes; TSMC capacity could become binding as hyperscaler ASIC programs compete for the same wafers; the exclusivity provisions could be triggered; OpenAI’s own counterparty fragility, as seen in Broadcom’s Project Nexus financing arrangements, could ripple into the MRA. Our framework maps each of these as a separate failure surface and scores them individually.
What I want readers to take from this piece is that the architectural surface is not just one risk among many. It is a systemic risk, in the sense that it determines whether the financial fundamentals are durable or transient, regardless of how the other surfaces resolve. If the inference economy continues to evolve toward heterogeneous, multi-model, scaffold-heavy, HBM-dependent execution, then the architectural fitness of the wafer-scale design for that regime is the question on which the long-term value of the equity ultimately rests.
We treat this architectural fracture as a single point of failure, not because the wafer-scale chip will stop working, but because the inference economy in which it must compete is taking a shape the chip was not designed for. The architecture is being out-competed on cost-normalized economics at the unit customers actually buy, out-positioned by the heterogeneous CPU-plus-GPU architectures that match the actual shape of agentic workloads, and out-invested by the HBM supercycle the rest of the industry is five-year-contracting to extend.
That diagnosis sets the fair value. Anchored to the AI-compute-hardware peer set of Nvidia, AMD, Marvell, and Broadcom at a forward enterprise-value-to-revenue multiple of around 10.5×, with a premium applied for the inference-specialization narrative and, then, a discount applied for the architectural exposures above, our framework produces a fair value of $17-20 billion on a $2.3 billion base-case of FY2027 revenue (yet to be proven).
The base case assumes the existing customer book grows roughly in line with the cloud-and-services trajectory; that 75% of near-term Remaining Performance Obligations are recognized on a back-loaded ramp; and that the AWS Bedrock arrangement does not produce quantifiable revenue within the forecast horizon, given the absence of definitive agreements.
Against Friday’s close at ~$280, the market is pricing Cerebras at roughly $60 billion, roughly $77-86 billion on a measure that includes restricted shares and warrants. That is three to four times where the architecture and the customer book justify the multiple, in our view. The gap will resolve in either direction over the next six to eighteen months as quarterly prints arrive and the agentic workload matures.
We are not bearish on the inference economy. We are bullish on it.
But the question of which silicon, in which architecture, captures the durable surplus inside the regime we believe in is a structural one when assessing the Company’s durable value. The answer is not obvious and requires nuance, given the stakes.
The views and opinions expressed in this Website are those of the author alone and are based on publicly available information. The expressed views and opinions do not constitute investment advice, a solicitation, or a recommendation to buy or sell any security or financial instrument.
The author may hold positions in the securities of companies mentioned. Certain companies referenced may be current or former clients of, or counterparties to, the author or affiliated entities; such relationships will be disclosed where applicable.
Past performance is not indicative of future results. To the fullest extent permitted by applicable law, the author does not accept any liability for any loss or damage arising from reliance on this content. Readers should conduct their own independent due diligence and consult a qualified financial advisor before making any investment decision.


