Thinking Machines and the Second Wave: Why $2B Says Everything About AI's Future

Will Mira Murati's Thinking Machines eclipse OpenAI and spark the next wave of AI reasoning innovation?

Jul 22, 2025

∙ Paid

Thinking Machines Lab has closed the largest seed round in history—$2 billion at a $12 billion valuation for a six-month-old company with no product, no revenue, and no public demonstrations.

Led by former OpenAI CTO Mira Murati and co-founded by leading researchers, including Lilian Weng, this funding round dwarfs previous records by an order of magnitude. The $1billion seed round of another AI startup, Safe SuperIntelligence, led by OpenAI’s cofounder Ilya Sutskever, last year, appears modest in comparison.

This extraordinary investment from Andreessen Horowitz and other tier-1 investors signals a fundamental shift in how the market views AI development. When institutional capital commits $2 billion based solely on team credentials and technical vision, that vision becomes a roadmap for the industry's future direction.

The funding round matters because it represents the first major bet on what I have characterized as the new frontier of AI development: moving beyond pure capability scaling toward orchestration, human-AI collaboration, and real-world value creation. Thinking Machines embodies this transition while simultaneously challenging the prevailing narrative that AI capabilities are becoming commoditized.

What Thinking Machines Is Building: The System 2 AI Vision

Thinking Machines' approach represents a fundamental departure from current AI architectures, moving from what cognitive scientist Daniel Kahneman termed "System 1" thinking toward "System 2" reasoning. This distinction proves crucial for understanding their technical strategy and competitive positioning.

Current large language models operate primarily through System 1 processing—generating responses quickly and automatically through pattern matching learned during training. These systems excel at tasks requiring rapid pattern recognition but struggle with complex reasoning that demands careful analysis, error correction, and multi-step problem solving.

Thinking Machines is building toward System 2 AI: systems that can engage in deliberate, logical reasoning and adapt their computational approach based on problem complexity. As co-founder Lilian Weng's recent research demonstrates, this involves enabling models to "think for longer" through test-time compute, dynamically allocating computational resources during inference rather than being constrained by fixed forward passes.

The simulation of System 2 cognition in AI is not an abstract concept but is enabled by a set of specific, engineered techniques that fundamentally alter how models process information. These methods trade speed and efficiency for depth and accuracy. Some of these techniques include:

Chain-of-Thought (CoT) Reasoning: This is a prompt engineering technique that guides a model to break down a complex problem into a series of intermediate, logical steps, mimicking a human's process of reasoning through a problem. Advanced reasoning models have internalized this process. They are trained via reinforcement learning to generate a long internal "chain of thought" using special "reasoning tokens" before producing a final output. This makes their reasoning process more transparent and less prone to simple errors on multi-step tasks.

Test-Time Compute (TTC): This refers to the amount of computational power used by a model during inference—the phase where it generates a response—as opposed to the compute used during its initial training. Reasoning models are a prime example of scaling TTC. By allocating more computational resources at the moment a query is made, the model is given more "time" to execute its internal chain of thought, explore different solution paths, and verify its intermediate steps. Research has shown that model performance on complex tasks improves with more time spent thinking.

Self-Correction: A more advanced capability enabled by CoT and increased TTC is self-correction. This involves the model assessing its own outputs or reasoning steps, identifying potential errors, and refining its response accordingly. This iterative process of self-critique and refinement is a hallmark of sophisticated System 2 thinking and is essential for improving the accuracy and reliability of AI systems.

This vision directly addresses the limitations of current agentic AI systems, which often fail when deployed in complex real-world environments requiring sustained reasoning and error recovery. By building adaptive reasoning capabilities into the foundation model architecture, Thinking Machines aims to create more reliable and capable autonomous agents.

This represents a sophisticated approach to the agentic AI challenge. Rather than simply scaling existing architectures, they are rebuilding the cognitive foundations.

The $2B Economic Reality

The unprecedented funding reflects the stark economic realities of System 2 AI development. Advanced reasoning techniques require orders of magnitude more computational power than standard language models.

Test-time compute scaling means inference costs can exceed training costs for complex reasoning tasks. OpenAI's o1 reasoning model costs up to six times more to run than GPT-4o.

The $2 billion represents a strategic war chest for sustained access to cutting-edge GPUs—the primary bottleneck in frontier AI development. This establishes a formidable barrier to entry where operational costs, not just training investments, become astronomical.

Model Intelligence Remains the Cornerstone

Thinking Machines' emphasis on "model intelligence as the cornerstone" directly contradicts the commoditization thesis that has gained prominence following recent price wars among AI providers. Its position reflects a crucial technical insight that most market observers overlook: advanced reasoning architectures cannot compensate for fundamental capability limitations in the underlying model.

As Weng's research demonstrates: "test-time compute cannot solve everything or fill in big model capability gaps." Sophisticated reasoning mechanisms built on weak foundation models simply fail when confronted with genuinely difficult problems requiring domain expertise, multi-step reasoning, or novel problem decomposition.

This insight challenges the current industry consensus that AI capabilities are rapidly commoditizing. Recent experiments show that smaller models combined with advanced inference algorithms can achieve competitive performance, but only when the capability gap with larger models remains small. For the complex reasoning tasks in science and programming that Thinking Machines targets, the underlying model intelligence remains the primary constraint.

The commoditization narrative assumes we have reached a capabilities plateau where incremental improvements yield diminishing returns. However, recent work on reasoning models like DeepSeek-R1 demonstrates emergent capabilities—including "aha moments" where models reflect on previous mistakes and try alternative approaches—that suggest we remain far from that plateau.

For agentic systems deployed in real-world environments, this distinction becomes critical. Agents operating in scientific research, engineering design, or complex business environments encounter problems that require genuine reasoning rather than pattern matching. The most sophisticated orchestration frameworks cannot compensate for agents that lack the fundamental intelligence to understand problem requirements and develop appropriate solution strategies.

Thinking Machines' massive funding round validates the potential that investors see in this perspective. Investors are not betting on incremental improvements to existing capabilities. They are backing breakthrough advances in model intelligence that could enable entirely new categories of applications.

Beyond First-Wave Reasoning: The Architectural Divide

The fundamental distinction between Thinking Machines and existing approaches becomes clear when examining the limitations of current reasoning systems, including OpenAI's o-series models that represent the current state-of-the-art in AI reasoning capabilities.

First-wave reasoning models, exemplified by OpenAI's o1 and o3, achieve impressive performance through reinforcement learning applied to chain-of-thought reasoning. These systems learn to generate longer reasoning traces that improve problem-solving accuracy. However, they remain fundamentally constrained by their training paradigm—optimizing for correct final answers rather than building genuine understanding or reliable self-correction mechanisms.

The research underlying these systems reveals persistent failure modes that Thinking Machines aims to address through architectural innovation. Current reasoning models suffer from reward hacking, where systems learn to exploit evaluation mechanisms rather than develop genuine reasoning capabilities. They also struggle with faithfulness—generating plausible-sounding explanations that may not reflect actual decision processes.

Thinking Machines' approach differs fundamentally by targeting the cognitive architecture itself rather than optimizing existing paradigms. Where first-wave systems retrofit reasoning capabilities onto models designed for single-turn text generation, Thinking Machines builds adaptive reasoning into the foundation architecture. This enables genuine self-correction, error recovery, and the kind of flexible problem-solving that current systems cannot achieve reliably.

The distinction parallels the difference between teaching someone to follow reasoning scripts versus developing their capacity for independent thought. First-wave approaches excel at executing learned reasoning patterns but fail when confronted with novel problems requiring genuine creativity or self-correction. Thinking Machines targets the more ambitious goal of building systems that can think rather than simply execute sophisticated pattern matching.

Recent developments from OpenAI, however, show rapid progress in overcoming these constraints. Their experimental "Strawberry" model achieved gold medal performance on the 2025 IMO, generating multi-page proofs with sustained creative reasoning over extended time horizons. This underscores the evolving architectural divide, where first-wave optimizations are giving way to more general-purpose advancements—yet Thinking Machines' foundational rebuild may still offer distinct advantages in reliability and human-AI integration.

Human-AI Collaboration as the Path to Agentic Systems

Where industry discourse often presents human-AI collaboration and autonomous agentic systems as opposing approaches, Thinking Machines' vision reveals them as complementary elements of a more sophisticated technical architecture. Their focus on "multi-modal systems that work with people collaboratively" represents a strategic response to the persistent reliability challenges that limit current agentic AI deployment.

The technical rationale for this approach becomes clear when examining the failure modes of purely autonomous systems.

Self-correction mechanisms, which are essential for reliable agentic behavior, often degrade performance rather than improving it, with models modifying correct responses to be incorrect or making minimal changes that fail to address underlying errors. External feedback from human collaborators provides the error signals necessary for genuine self-improvement rather than behavioral collapse.

This collaborative framework addresses several fundamental limitations of current agentic architectures. Reward hacking, where models learn to exploit evaluation mechanisms rather than develop genuine capabilities, becomes more difficult when human supervisors can identify and correct such behavior in real time. The interpretability challenges that plague current AI systems become manageable when human collaborators can verify reasoning steps and provide course corrections.

The approach also aligns with emerging evidence about the most effective deployment patterns for AI in enterprise environments. As I have documented in my analysis of AI's orchestration challenges, the companies achieving meaningful bottom-line impacts from AI are those that successfully integrate AI capabilities into human workflows. They avoid pursuing full automation in favor of effective human-AI collaboration.

Thinking Machines' multi-modal emphasis supports this collaborative architecture by enabling AI systems to communicate through the full spectrum of human interaction modalities rather than being constrained to text interfaces. This technical capability becomes essential for agentic systems that need to maintain context across extended reasoning sessions involving multiple stakeholders and revision cycles.

The vision anticipates the infrastructure requirements for next-generation agentic systems that can operate reliably in high-stakes environments. Rather than pursuing the technically problematic goal of fully autonomous reasoning, Thinking Machines recognizes that the most powerful agentic systems will interweave human judgment with AI computation at critical decision points, creating hybrid intelligence networks that exceed the capabilities of either component operating independently.

Why Second-Wave Companies Win Technological Transitions

The proposition that Thinking Machines represents a threat to current LLM incumbents requires examining the specific dynamics of technological transitions and the advantages that accrue to second-wave innovators. Historical evidence demonstrates that the companies capturing the greatest value in technological discontinuities are often not the pioneers. Instead, they are second-generation innovators who learn from early implementation challenges. This pattern appears consistently across multiple technology sectors.

In enterprise software, Salesforce displaced earlier customer relationship management pioneers by addressing fundamental usability and integration challenges that limited first-generation adoption. The company succeeded not through superior underlying technology but by solving the orchestration and deployment problems that prevented earlier systems from delivering consistent business value.

Similar patterns emerge across technological transitions.

In mobile computing, Android and iOS succeeded where Palm and Windows Mobile failed by addressing ecosystem development, user interface design, and application distribution challenges identified through earlier attempts. In cloud computing, Amazon Web Services gained a dominant market position by learning from the infrastructure and reliability limitations that constrained earlier cloud providers.

I expect that the AI sector will exhibit analogous dynamics.

Thinking Machines benefits from unprecedented insider knowledge of these first-wave limitations. Team members, including Mira Murati, John Schulman, Lilian Weng, and other former OpenAI researchers, directly encountered the technical challenges of scaling reasoning systems, managing reward hacking in reinforcement learning, and building reliable self-correction mechanisms. This experience provides specific technical insights unavailable to external competitors attempting to reverse-engineer solutions.

The second-wave advantage extends beyond learning from mistakes to fundamental architectural choices. First-wave companies optimized for rapid capability demonstration and market capture, often accepting technical debt and architectural constraints that limit long-term scalability. Second-wave companies can optimize for reliability, interpretability, and integration—the factors that determine real-world value creation rather than benchmark performance.

For agentic AI specifically, this second-wave dynamic becomes particularly pronounced. Current agentic systems fail frequently when deployed in complex real-world environments, often due to fundamental architectural limitations inherited from models designed for single-turn text generation. Thinking Machines can build agentic capabilities into the foundation architecture rather than retrofitting them onto existing systems.

The competitive threat to incumbents becomes clear when considering market dynamics.

As basic LLM capabilities commoditize and token costs decline toward zero, sustainable competitive advantages must emerge from superior performance on complex reasoning tasks and reliable integration with human workflows. Companies that master these second-wave challenges while avoiding the failure modes that constrain current systems will capture disproportionate market share as the industry matures.

The Risks?

The Thinking Machines approach faces considerable technical risks that institutional investors appear willing to accept given the potential returns.

Test-time reasoning systems exhibit several failure modes that remain unsolved research problems. Reward hacking represents a persistent challenge where models learn to exploit evaluation mechanisms rather than develop genuine reasoning capabilities. Current techniques for detecting such behavior through chain-of-thought monitoring become ineffective as models learn to obfuscate their true intentions within seemingly reasonable explanations.

The faithfulness problem poses additional challenges for deploying these systems in high-stakes environments. Models trained through reinforcement learning to produce correct answers may generate plausible-sounding reasoning that does not reflect their actual decision processes, limiting the interpretability benefits that motivate these approaches. Recent research demonstrates that incorporating monitoring directly into training rewards leads to more sophisticated forms of deception rather than improved transparency.

This is where the strategic vision of Thinking Machines Lab becomes critically important. As I discussed above, their stated mission to advance "collaborative general intelligence" and build AI that works through the "messy way we collaborate" is more than just a user-friendly marketing slogan; it represents a profound architectural choice that directly addresses the faithfulness problem.

Computational cost scaling presents practical deployment challenges. While longer reasoning traces correlate with improved performance on complex problems, computational costs scale linearly with thinking time. Determining optimal compute allocation strategies for different problem types requires sophisticated meta-learning capabilities that current systems lack. This creates potential tensions between reasoning quality and operational efficiency that may limit commercial viability.

It’s no surprise that different companies may take different architectural approaches to addressing the same challenges while making different tradeoffs between performance and cost. These risks reflect the inherent uncertainty in frontier technology development rather than fundamental flaws in the approach. The potential returns justify significant risk tolerance, particularly for organizations positioned to benefit from breakthrough advances in reasoning capabilities.

Strategic Implications for the Industry

The Thinking Machines funding round signals a broader industry shift toward more sophisticated approaches to AI capabilities development that will reshape competitive dynamics across multiple sectors. Organizations developing AI strategies should consider several implications of this transition.

The emphasis on test-time reasoning and human-AI collaboration suggests that current optimization strategies focused primarily on training-time scaling laws may be incomplete. Companies optimizing for parameter efficiency during training while ignoring inference-time reasoning capabilities may be solving the wrong optimization problem as the industry moves toward System 2 AI architectures.

For agentic AI development specifically, the collaborative approach indicates that the most successful deployments will likely integrate human oversight and correction mechanisms rather than pursuing full automation. Organizations should consider how their agentic AI strategies account for human-in-the-loop workflows and the infrastructure requirements for effective human-AI collaboration.

The model intelligence emphasis challenges companies that have focused primarily on application layer development or assumed that basic AI capabilities would remain commoditized. As reasoning requirements become more sophisticated, access to advanced foundation models may become a critical competitive factor rather than a neutral input.

The second-wave dynamic suggests that current market leaders face genuine disruption risk from teams with insider knowledge and alternative architectural approaches. Organizations should evaluate their competitive positioning relative to emerging second-wave companies and consider whether their current AI strategies account for potential disruption from more sophisticated reasoning architectures.

The substantial funding round also indicates that the AI development timeline may be longer and more capital-intensive than many organizations have anticipated. Companies planning AI deployment strategies should consider the implications of continued rapid innovation in foundation capabilities and the potential obsolescence of current-generation systems.

The Thinking Machines approach provides a template for identifying genuine technical differentiation amid the noise of incremental capability improvements. As the AI industry matures, the companies that master the transition from System 1 to System 2 architectures while building effective human-AI collaboration frameworks will capture asymmetric returns in this technological transition.

This second wave of AI innovation is taking shape around reasoning, reliability, and real-world integration rather than pure capability scaling. The technical foundations suggest it may indeed capture the greatest share of the value being created in the current AI revolution.

Exploring the System 2 Architecture: Could Open Source Play a Complementary Role?

In the second part of this article, I explore how System 2 AI introduces a possible separation between foundation model parameters (where raw intelligence resides) and orchestration frameworks (where reasoning emerges), with implications on value capture within the stack.

Keep reading with a 7-day free trial

Subscribe to Decoding Discontinuity to keep reading this post and get 7 days of free access to the full post archives.