TurboQuant and the Memory Stock Sell-Off: Why the Panic Outpaced the Paper
Why this efficiency gain is ultimately bullish for the memory chokepoint and the entire inference economy.

A Google blog post about compressing AI memory went viral. Within 48 hours, the market capitalization of memory semiconductors evaporated by over $100 billion. TurboQuant helps solve a core problem of the Agentic era: making long-context LLM inference efficient. But the algorithm compresses only the inference-time cache, not the model weights, training data, or storage.
The latest sign of the market’s struggle to price the Agentic Era arrived in the form of a Google blog post about AI memory compression. Within 48 hours, ~ $100 billion in semiconductor value evaporated, not because the technology changed the economics of the stack, but because investors misidentified where those economics now sit.
The tremor began when Google Research published a blog post on March 24, titled “TurboQuant: Redefining AI Efficiency with Extreme Compression.“ The Google post summarizes a series of papers actually published between mid-2024 and April 2025.
Nothing about the science was new. Only the packaging was new.
Google distilled three academic papers into a single, accessible narrative in a blog post and then a tweet: 6× less memory, 8× faster inference, zero accuracy loss. The tweet received close to 19 million views.
The hot take reactions were predictable: Improved memory efficiency would reduce overall hardware demand. TechCrunch referenced the HBO satire “Silicon Valley,” calling it the “real-life Pied Piper.” Cloudflare CEO Matthew Prince called it “Google’s DeepSeek moment,” noting that there is “so much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization.”
The comparison primed investors for fear. Algorithmic selling began a scorched-earth campaign across the memory infrastructure sector.
Micron fell 30% since its March 18 earnings report. SK Hynix dropped 6.2% in a single session. SanDisk - NAND flash storage, zero connection to inference-time cache compression - shed 18% in five days. NVIDIA, which actively builds quantization tools and whose Blackwell architecture is optimized for exactly the kind of low-precision computation TurboQuant enables, fell 6.6%.
Here is the problem with those hot takes: TurboQuant is significant. But not for the reasons widely assumed.
TurboQuant points to an important advance in a very specific way, one that may address a critical bottleneck. However, the sell-off conflated a narrow efficiency gain at one layer of the AI stack with a structural reduction in demand across the entire stack.
In that respect, the market reaction to TurboQuant is not just a misreading of a paper. It is a symptom of a deeper analytical gap in how investors are pricing the AI stack in the Agentic Era. The consensus still treats AI infrastructure as a monolithic trade. Everything rises together on bullish narratives; everything falls together on efficiency headlines.
But memory is not compute. The KV cache is not the hard drive. And Sandisk is not SK Hynix.
With the right lens, the TurboQuant tale will, in fact, deliver yet another twist: rather than being a looming threat value destroyer, it is more likely that overall demand will expand as cheaper, more efficient inference unlocks far greater scale, concurrency, and adoption of AI systems.
Memory is the chokepoint
To understand why the TurboQuant sell-off was wrong, you first need to understand what memory has become in the AI economy, and why it is not the “picks and shovels” metaphor we keep reaching for.
Picks and shovels are commodity inputs. Abundant, interchangeable, priced at marginal cost. Memory is the opposite of that. Memory is the single most concentrated, supply-constrained, highest-pricing-power layer in the entire AI infrastructure stack.
South Korean companies SK Hynix and Samsung control about 80% of the global HBM (high-bandwidth memory) supply, the specialized memory chips physically stacked onto every AI GPU. Micron, a US company and the third-largest producer, acknowledged in December that it meets only about half of its backlog and warned that the crunch will persist beyond 2026. New fabrication facilities take years to qualify. Goldman Sachs projects a 4.9% DRAM undersupply in 2026. That’s the most severe shortfall in more than fifteen years.
Even if you manufacture enough memory, you cannot assemble it without TSMC’s CoWoS advanced packaging, which physically bonds HBM to the GPU. TSMC has indicated CoWos capacity is fully utilized and will remain very tight into 2026. NVIDIA alone has secured 60% of total CoWoS output.
Both constraints are physical. Neither responds to software breakthroughs. Neither can be resolved by an algorithm.
This is why memory stocks ran 200 to 1,200% over the prior year. These three companies represent a form of structural scarcity at the binding constraint of a $4T capex cycle over five years per FT estimates. These companies are directing the bulk of their operating cash flow, supported increasingly by debt, to build AI infrastructure.
Every dollar of that spending flows through the memory bottleneck at some point.
In any technology system, value concentrates at the layer that constrains throughput. Cloud compute, by contrast, is being built by five hyperscalers, plus CoreWeave, Lambda, sovereign clouds, and enterprise on-premises deployments.
While the cloud layer converges toward utility pricing and, in the worst case, leads to oversupply, memory does not. The scarcity is structural, the barriers are physics-based, and the timeline for new capacity is measured in years, not quarters.
Memory is not picks and shovels. It is closer to the new Magnificent Seven. Except that the concentration is even tighter. Two or three companies, not seven, sit at the single most constrained point in the most capital-intensive buildout in the history of technology.
That is the context the market forgot when the Google blog post went viral.
The compression gap
When a large language model generates text, it maintains two types of memory on the GPU.
The first is the model’s weights. These are the billions of parameters that encode everything it learned during training. These are static. They get loaded once and sit in memory for the duration of the session.
The second is the KV cache. This is the running record of every token the model has processed so far in the current conversation. Think of it as the model’s short-term working memory. Each time it generates the next word, it needs to look back at its representation of all the previous words. This cache grows with every token. It scales linearly with context length and with the number of users being served simultaneously.
For a short ChatGPT query that stretches for only a few thousand tokens, the weights dominate, and the cache is negligible. But context windows now stretch to 128,000 tokens and beyond. At those lengths, the KV cache can consume 80% or more of total GPU memory, dwarfing the weights themselves.
TurboQuant only compresses this second category.
The algorithm was developed by Amir Zandieh and Vahab Mirrokni at Google Research. It fuses two prior methods into a unified framework. The first, PolarQuant (to be presented at AISTATS 2026), applies a random orthogonal rotation to each KV vector.
After rotation, each coordinate follows a known statistical distribution, which means you can apply a single precomputed codebook to quantize the entire vector without needing the per-block normalization constants that waste 1-2 bits of overhead in prior methods. The second, QJL (published at AAAI 2025), applies a 1-bit correction using a Johnson-Lindenstrauss projection to eliminate the systematic bias left by the first stage.
The combined result, published as “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate“ and accepted at ICLR 2026, achieves approximately 3.5 bits per coordinate, provably near-optimal, within a factor of 2.7× of the theoretical information-theoretic limit.

The practical results are genuine.
At 3.5 bits per channel, TurboQuant incurs no measurable accuracy degradation compared to full FP16 precision. At 4-bit precision on H100 GPUs, it delivers up to 8× speedup on attention logit computation. Critically, this applies to one component of inference, not end-to-end throughput.
The scope is narrow. But the narrowness is the point.
TurboQuant does not compress model weights. A 70-billion-parameter model requires exactly the same HBM after TurboQuant as before. It does not affect training. It has zero impact on NAND flash storage.
It is not even the most aggressive KV cache method at ICLR this year. NVIDIA’s KVTC achieves 20× compression with less than 1% accuracy penalty using a different approach entirely.
And the headline figure likely overstates the marginal improvement. Most production inference already runs at 8-bit precision, not the FP32 baseline that Google benchmarked; the real-world gain is closer to 2.6×, according to Seoul Economic Daily analysts.
Walk through what this means for the companies that the market punished, and the irrelevance becomes stark.
SanDisk and Western Digital make NAND flash, the persistent storage that holds datasets, model checkpoints, and system logs. The KV cache never touches persistent storage. It exists only in volatile GPU memory and vanishes the moment a session ends. The sell-off confused semiconductor working memory and the hard drive in your laptop, which are as different as a whiteboard and a filing cabinet.
SK Hynix and Micron manufacture HBM, the high-bandwidth memory chips stacked on every GPU. The demand for HBM is driven by model weights, which set the floor for how much memory each GPU requires. A 70-billion-parameter model loads approximately 140 GB of weights into HBM, regardless of the cache's state.
TurboQuant compresses the variable part of GPU memory: the cache that grows with usage. It does not compress the fixed part: the weights that determine how many HBM chips need to be manufactured, packaged, and shipped.
SK Hynix’s order book is set by model-architecture roadmaps, not by cache-compression algorithms. NVIDIA, meanwhile, is not just unaffected. It benefits. Quantization makes each GPU more productive per dollar, the Blackwell architecture is designed for low-precision compute, and the Groq acquisition underscores the inference pivot that TurboQuant validates.
The sell-off was wrong about who gets hurt. The more interesting question is who benefits, and the answer requires understanding what is happening to context length.
The inference economy gets cheaper and bigger
So, what does TurboQuant change?
To answer that, you need to understand what is happening to context length and why it is the dimension of AI that is scaling fastest.
A year ago, most production LLM workloads operated at 4,000 to 8,000 tokens. At that scale, model weights dominate GPU memory, and the KV cache is a rounding error. TurboQuant would have been a footnote.
What has changed is that the frontier has moved to 128,000 tokens and beyond. Claude Opus 4.6 expanded its context window to 1M tokens, and DeepSeek V4 is expected to have the same amount. The systems being built on top of these models are pushing context consumption dramatically further.
An agent preparing a market analysis does not process a single prompt. It retrieves prior research, injects relevant case history, cross-references multiple documents, and maintains a running chain of reasoning across dozens of tool calls.
Each step inflates the context. A coding agent debugging a complex system pulls in file trees, error traces, documentation, and the history of its previous attempts. A legal agent reviewing a contract loads the full document, comparable precedents, and the client’s negotiation history. These are not hypothetical workloads. They are what Claude, GPT-4, and Gemini process millions of times daily.
The KV cache is the part of memory that TurboQuant compresses. In every case, it is the binding constraint on how many of these sessions a single GPU can serve simultaneously.
This is the intersection that matters. The inference economy is the economic shift I have been tracking, where AI systems capture value directly from the $60 trillion global labour market. And it runs on long-context inference.
Training is the arms race. Inference is the economy.
Inference already accounts for more than 60% of AI workloads and is growing toward 80%.

At GTC 2026, Jensen Huang built his keynote around it. Data centers are factories. Their product is the token. The governing metric has shifted from training FLOPS to tokens produced per watt.
TurboQuant is an inference economy technology. It operates exclusively at the inference layer, provides zero benefit to training, and its entire value proposition is to make long-context inference economically viable at scale.
Before TurboQuant, serving a 70-billion-parameter model to 512 concurrent users with 128K contexts required a multi-node GPU cluster, perhaps $50,000-100,000 per month. The KV cache alone consumed most of the available memory, limiting concurrency to a handful of sessions per GPU.
After TurboQuant, that same workload might fit on two H100s. The cache shrinks. The number of concurrent sessions per GPU expands. The cost per session collapses.
This does not reduce the size of the inference economy. It reduces the unit cost of participation. And participation is exploding. Agentic workloads multiply token consumption by 10× to 100× compared to traditional chat. Inference costs dropped nearly 1,000× over the past three years.
As the unit price of intelligence falls, firms redesign architectures to consume more compute, not less. Longer contexts. Deeper reasoning chains. More agents running in parallel. TurboQuant makes each of those sessions cheaper. It does not make fewer sessions get served.
Where value migrates when inference gets cheap
There is a deeper consequence to TurboQuant.
Consider what happens as agentic workloads grow more sophisticated. Today’s agents do not start each task from scratch. The most advanced ones learn from experience. A coding agent that has debugged a thousand React async errors does not approach the thousand-and-first the way it approached the first. It retrieves its record of what worked, such as which stack traces led to which root causes, which patches held and which introduced regressions, and which edge cases recurred. It then injects all of that prior experience into the context window before it writes a single line of new code.
This is the architecture I analyzed last summer in my piece on Memento, a research framework from UCL and Huawei that demonstrated something striking: agents can achieve state-of-the-art performance not by making the underlying model bigger or smarter, but by giving it access to a bank of its own past experiences at runtime.
The model’s weights stay the same. What changes is what gets fed into the context window: a curated library of prior successes and failures, retrieved by relevance and injected alongside the current task.
The investment implication flows from there:
These memory-augmented agents are the most context-hungry workloads in the AI economy.
Every experience retrieved is more tokens in the context window.
Every token in the context window is more KV cache on the GPU.
A standard ChatGPT query might consume a few thousand tokens. A memory-augmented legal agent reviewing a contract will load the document, comparable precedents, the client’s negotiation history, and its own record of what clauses triggered disputes in past reviews. This can consume hundreds of thousands.
The KV cache is what makes these workloads expensive.
TurboQuant compresses the KV cache.
This is where TurboQuant’s real significance lies. Not in reducing demand for memory chips, but in unlocking an entirely new tier of AI adoption. It reshapes where competitive advantage accumulates.
I have argued that we are witnessing the emergence of a new asset class that I call “memory capital.” This is the accumulated record of what worked and what failed across millions of agent interactions.
When inference was expensive, only the companies with the largest compute budgets could afford to run the workloads that generate this data. When inference gets cheap, the bottleneck shifts. The scarce resource is no longer silicon. It is the quality and depth of the execution data you feed into those longer, cheaper context windows.
TurboQuant compressed a cache. But more importantly, it also widened the aperture through which memory capital flows. The companies that compound execution data at the new, lower price point will build the moats of the next decade.
The context mispricing
The market reacted to TurboQuant as if it were a demand shock. In reality, it is a supply-side efficiency gain at a single, narrow layer of the inference stack.
If anything, TurboQuant tightens the bottleneck. When each GPU can serve more users, more organizations deploy AI, more applications get built, and total demand for GPUs increases as each requires HBM and CoWoS packaging.
The bottleneck gets more valuable, not less.
The real question that the market should be pricing is not whether inference memory can be compressed. It is whether the order book holds.
The bottleneck has value only because demand exceeds supply. Demand, in this market, is purchase orders from five hyperscalers spending $660 to $690 billion in 2026. If a major lab concludes its models are good enough and cuts procurement, the bottleneck does not gradually loosen. It flips.
Through 2027, and maybe even 2028, the dual physical constraints of HBM manufacturing concentration and CoWoS packaging scarcity hold. Beyond that, it depends on whether inference revenue materializes at scale.
That is the real risk. Not an algorithm that compresses scratch paper on a GPU.
TurboQuant did not shrink the AI economy. It lowered the price of admission. Training is the arms race. Inference is the economy. TurboQuant may make the economy cheaper to run.


