← Terug naar blog

From the age of scaling to the age of research

AI

Summary

The trajectory of artificial intelligence stands at a definitive historical inflection point. For the better part of a decade, the field has been dominated by a singular, industrial logic: the “Age of Scaling.” This era, roughly spanning from the release of GPT-3 in 2020 to the saturation points observed in 2025, was characterized by the empirical triumph of power laws the observation that intelligence, or at least the proxy of next-token prediction accuracy, scales predictably with the exponential increase of computational operations, dataset size, and parameter counts. It was an era of engineering triumphs over scientific discovery, where capital expenditure on GPU clusters became the primary determinant of capability. However, this report argues, based on a rigorous reconstruction of Ilya Sutskever’s strategic thesis and corroborating empirical evidence, that the Age of Scaling is concluding. We are entering the “Age of Research,” a period defined not by the brute force of scale, but by the necessity of fundamental architectural breakthroughs in reasoning, generalization, and value alignment.

This comprehensive analysis deconstructs the claims made by Ilya Sutskever following his departure from OpenAI and the founding of Safe Superintelligence Inc. (SSI). It validates his central assertion: that pre-training on human-generated data is approaching a hard asymptote a “data wall” and that current models, despite their brilliance on static benchmarks, exhibit a “jagged frontier” of capabilities that renders them unreliable for high-stakes economic or safety-critical deployment. The report synthesizes data from 2024 and 2025, including the performance of reasoning models like OpenAI’s o1/o3 and DeepSeek’s R1, to demonstrate that the industry is pivoting toward “inference-time compute” and “system 2” reasoning as the new engines of progress. This shift fundamentally alters the risk landscape, moving the danger zone from training runs to test-time execution and internal deployments.

Furthermore, this report scrutinizes the institutional design of SSI. By adopting a “straight-shot” business model that eschews intermediate commercial products, SSI attempts to insulate safety research from the perverse incentives of the product cycle. While this aligns with long-term safety goals, it introduces profound transparency risks and creates a “shadow development” paradigm that existing governance frameworks, specifically the EU AI Act and US Executive Order 14110 are ill-equipped to regulate. The exemption for “scientific research” creates a regulatory blind spot where superintelligent capabilities could be developed in isolation, shielded from oversight until the moment of potential breakout.

The following analysis is structured to guide technical leaders, policymakers, and strategists through this volatile transition. It offers a detailed roadmap for an R&D portfolio that prioritizes robust value learning over raw scale, and a governance framework that shifts from market-based triggers to capability-based oversight of internal research environments.

Chapter 1: The Trajectory of Intelligence   From Scaling to Stagnation

To comprehend the magnitude of the shift Sutskever describes, one must first dissect the mechanics of the era we are leaving behind. The “Age of Scaling” was not merely a trend; it was a scientifically validated hypothesis that became an industrial dogma. However, like all exponential trends in physical systems, it has encountered the friction of finite resources in this case, the finiteness of human thought recorded as data.

1.1 The Architecture of the Scaling Era (2020–2025)

The genesis of the Scaling Era lies in the seminal work on neural scaling laws, most notably by Kaplan et al. (2020) and later refined by the Chinchilla team at DeepMind. These papers established a quantifiable relationship between compute, dataset size, and model parameters , taking the general form, where L is the loss. This observation transformed AI research from a cottage industry of algorithmic tinkering into a capital-intensive industrial process.1

During this period, the “recipe” for advancing the state of the art was deceptively simple:

This approach yielded spectacular results, moving from GPT-2’s incoherent babble to GPT-4’s professional-grade prose. However, Sutskever’s critique is that this success masked a fundamental hollowness: the models were learning to imitate the distribution of human text, not to understand the underlying causal structures of the world.4 They became “statistical mimics” par excellence, but their ability to generalize to novel situations remained brittle.

1.2 The Data Wall and the Limits of Pre-Training

The primary driver of the shift to the “Age of Research” is the phenomenon known as “Peak Data.” As analyzed by Epoch AI and corroborated by Sutskever, the stock of high-quality, human-generated public text data is finite. Projections indicate that at current scaling rates, models will have exhausted the entire high-quality public internet between 2026 and 2032.6

The Quality-Quantity Trade-off:

1.3 The “Jagged Frontier” of Capability

The most damning evidence against the “scaling is all you need” hypothesis is the persistence of the “jagged frontier.” This term describes the uneven, unpredictable nature of current AI capabilities. A frontier model in 2025 might score in the 99th percentile on the US Medical Licensing Exam (USMLE) yet fail to solve a simple visual logic puzzle from the ARC-AGI benchmark that a human child could solve in seconds.12

Table 1: The Paradox of Performance (2025 Benchmarks)

Benchmark DomainTop Model Score (approx.)Human BaselineImplicationMedical Knowledge (MedQA)~95%~60-85% (Expert)Superhuman memorization and retrieval.Competitive Coding (Codeforces)~89th PercentileVariesExpert-level pattern matching and syntax generation.Visual Reasoning (ARC-AGI)~53-75%>85% (Non-expert)Sub-human ability to adapt to novel rules/concepts.Simple Logic (Salesforce SIMPLE)~60-70%~90-100%Failures on trivial “distractor” tasks.

This jaggedness confirms Sutskever’s view that models are essentially “overfitting” to the distribution of human knowledge without acquiring the general-purpose reasoning machinery that characterizes biological intelligence.14 They are not “smart” in the human sense; they are vast, searchable archives of crystallized intelligence. When faced with a problem that requires fluid intelligence and the ability to reason through a novel situation using first principles they often hallucinate or fail catastrophically.

1.4 The Shift to Inference-Time Compute

The industry’s response to the pre-training plateau has been the pivot to “Inference-Time Compute,” exemplified by OpenAI’s o1 and o3 models. This represents a new scaling law: Accuracy scales with the amount of time the model spends “thinking” before answering.16

This mechanism differs fundamentally from pre-training. Instead of embedding knowledge into the weights (learning), the model uses its existing weights to search through a tree of possibilities, evaluate partial solutions, and backtrack when it detects an error a process known as “System 2” reasoning.18

This transition marks the end of the Age of Scaling as we knew it. We are no longer limited by how many GPUs we can string together for a training run, but by our ability to design architectures that can utilize inference compute effectively to generalize and self-correct. This is the dawn of the Age of Research.

Chapter 2: The Age of Research   Value Functions, Emotions, and the Biological Turn

Ilya Sutskever’s thesis for the “Age of Research” is not merely a call for new algorithms; it is a philosophical pivot toward biology. He argues that to bridge the gap between the “jagged” savant-like capabilities of LLMs and the robust, general intelligence of humans, we must solve the problem of Value Learning.

2.1 The Value Function Hypothesis

In Reinforcement Learning (RL), a value function estimates the total expected future reward an agent can achieve from a given state. It is the agent’s internal compass, telling it whether a situation is “good” or “bad” long before the final outcome is realized.22

Sutskever posits a direct equivalence: Emotions are biological value functions.14

2.2 Hierarchical Reinforcement Learning (HRL) and Intrinsic Motivation

To operationalize this biological insight, the “Age of Research” will likely focus on Hierarchical Reinforcement Learning (HRL) and Intrinsic Motivation.25

The Architecture of Hierarchy:

Current LLMs largely operate as “flat” predictors. HRL proposes a tiered architecture:

This mirrors the brain’s organization, where the prefrontal cortex (System 2) engages in planning and inhibition, while the basal ganglia and motor cortex (System 1) handle execution.25 Research in 2025 has begun to show that RL can induce emergent hierarchies in LLMs, where certain attention heads specialize in long-term planning while others handle syntax.27

Intrinsic Motivation:

Sutskever’s thesis aligns with the work of researchers like Karl Friston (Active Inference), who argue that intelligence is driven by the minimization of “free energy” (surprise).28 Agents should be self-motivated to explore their environment to reduce uncertainty, rather than just chasing external rewards.

2.3 The Neurosymbolic Renaissance

The reliability crisis has also revitalized Neurosymbolic AI. The “pure scaling” hypothesis assumed that neural networks would eventually “grok” logic and symbolic reasoning perfectly. The persistence of the jagged frontier suggests otherwise.32

2.4 Unlearning and the Safety Imperative

A critical, often overlooked component of Sutskever’s safety thesis is Unlearning.36 If we build a superintelligence, we must be able to selectively excise hazardous capabilities (e.g., bioweapon design) without lobotomizing its general reasoning ability.

Chapter 3: The Empirical Reality of 2025   Reasoning, Reliability, and Benchmarks

The theoretical shift to the “Age of Research” is mirrored by the messy, complex empirical reality of AI in 2025. This chapter analyzes the performance of the latest “reasoning models” (o1, o3, R1) to validate the claims of jaggedness and the potential of inference scaling.

3.1 The Rise of Reasoning Models: o1 and o3

The release of OpenAI’s o1 and o3 series marked the first successful productization of “System 2” AI. These models do not just predict the next token; they generate a hidden “chain of thought” often thousands of tokens long before producing a user-facing response.

Benchmark Performance Analysis:

3.2 The Persistence of Jaggedness: The ARC-AGI Failure

Despite these triumphs, the ARC-AGI benchmark remains a stubborn outlier. ARC (Abstraction and Reasoning Corpus), developed by François Chollet, tests a system’s ability to learn new logical rules from just 3-4 examples (few-shot learning) and apply them to a test case.43

3.3 Reliability and the “Agentic” Crisis

For AI to be economically transformative, it must be reliable. An autonomous agent that navigates a computer to book a flight or write software cannot fail 10% of the time.

3.4 The Economics of Inference Scaling

The shift to reasoning models introduces a new economic paradigm. “Intelligence” is no longer a fixed property of the model weights; it is a variable cost.

Chapter 4: The Institutional Architecture of AGI   SSI vs. The Field

The “Age of Research” demands not just new algorithms, but new institutional structures. The perverse incentives of the commercial “arms race” have arguably degraded the quality of safety research. Safe Superintelligence Inc. (SSI) represents a radical experiment in organizational design.

4.1 The SSI “Straight Shot” Thesis

SSI’s strategy is defined by its “straight shot” mission: “One goal and one product: a safe superintelligence.”.49 This is a rejection of the dominant “iterative deployment” model practiced by OpenAI, Google, and Anthropic.

The Logic of Insulation:

Financial Structure and Valuation:

SSI raised approximately $1 billion at a $5 billion valuation (and reportedly later rounds at $32 billion valuation) without a product.51 This relies on a Venture Capital thesis that views AGI as a binary, winner-takes-all event. Investors are betting on the terminal value of the company owning a share of the first safe superintelligence rather than discounted cash flows from SaaS subscriptions. This structure mimics the early days of DeepMind, but with significantly higher stakes.53

4.2 Comparative Analysis of Lab Strategies

Strategic Dimension****SSI (Sutskever)****OpenAI (Altman)****Anthropic (Amodei)****Meta (LeCun)****Primary GoalSafe Superintelligence (Terminal)AGI + Product DominanceReliable, Steerable AI SystemsOpen Science / AGI via World ModelsScaling PhilosophyScaling is plateauing; Research FirstScaling + Post-training + Product FlywheelScaling with strict Safety Cases (RSP)LLMs are insufficient; JEPA/World ModelsCommercial Strategy****Zero Revenue (Straight Shot)Aggressive B2B/B2C (ChatGPT, API)Enterprise Focus (Claude)Open Weights (Commoditize the layer)Safety ApproachFundamental Alignment (Value Functions)RLHF, Red Teaming, Post-hocConstitutional AI, RLAIFTransparency, Architectural grounding**Transparency****Opaque (Stealth)**Moderate (System Cards, no weights)Moderate (Research papers, no weights)High (Open Weights)

4.3 Critique of the SSI Model

While philosophically pure, the SSI model faces significant critiques:

Chapter 5: Governance in the Shadow of Superintelligence

The rise of “research-first” labs like SSI and the shift to inference-time capabilities expose critical gaps in the current global AI governance architecture. Regulations designed for the “Age of Scaling” focused on training compute thresholds and market placement are becoming obsolete.

5.1 The Regulatory Blind Spot: Research Exemptions

Both the EU AI Act and US Executive Order 14110 contain loopholes that could allow a lab like SSI to develop superintelligence with minimal oversight.

5.2 The “MAIM” Framework: A Geopolitical Stability Model

The document “Superintelligence Strategy: Expert Version” introduces the concept of Mutual Assured AI Malfunction (MAIM) as a stability framework for the AGI era.60

5.3 Governance Recommendations: Closing the Gaps

To govern the “Age of Research” effectively, policymakers must pivot from market-based regulation to capability-based regulation.

Chapter 6: Strategic Roadmap   Navigating the Age of Research

Based on the synthesis of Sutskever’s thesis, empirical data, and the governance landscape, this report outlines a strategic agenda for the next 3-5 years (2025-2030).

6.1 R&D Portfolio Blueprint

For AI labs and research organizations, the “Age of Research” demands a reallocation of resources from pure scaling to three core tracks:

6.2 Enterprise & Investment Strategy

6.3 Conclusion

Ilya Sutskever’s declaration is a siren calling for the end of the “easy money” era of AI. The physics of scaling are encountering the biology of intelligence. The path forward is no longer a straight line of exponential compute; it is a branching tree of difficult scientific questions about the nature of reasoning, values, and reliability. Whether SSI succeeds in its “straight shot” or not, the industry has irrevocably shifted. The winners of the next decade will not be those with the biggest clusters, but those who can engineer the “ghost in the machine” the value functions and reasoning structures that turn a statistical predictor into a reliable, safe, and truly intelligent agent.

Geciteerd werk

DjimIT Nieuwsbrief

AI updates, praktijkcases en tool reviews — tweewekelijks, direct in uw inbox.

Gerelateerde artikelen