Why the AI Wall Keeps Moving

There has always been doubt that AI will continue to progress. One such argument was that as the length of tasks AI is solving increases, compute and the techniques used to train the models will not be able to keep pace. The empirical reality has ended up showing that AI progress continues.

Share

This blog was inspired by one of Dwarkesh Patel's proposed questions for his Blog prize for the big questions about AI:

A couple years ago, there was this idea that AI progress might slow down as we make further progress into the RL regime. 1. Because as horizon lengths increase, the AI needs to do many days' worth of work before we can even see if it did it right, so if we're still in a naive policy gradient world, the reward signal / FLOP goes down, and 2. We'd crossed through many OOMs of RL compute from GPT 4 to o1 to o3, and it would not be feasible to replicate that many OOMs increase in compute immediately again. But AI progress seems to have been fast nonetheless - even potentially speeding up if rumors about Spud or Mythos are to be believed. What gives? What did that previous intuition pump that motivated longer timelines miss? Feel free to deny premise of question.

There has always been doubt that AI will continue to progress. One such argument was that as the length of tasks AI is solving increases, compute and the techniques used to train the models will not be able to keep pace. The empirical reality has ended up showing that AI progress continues. This is a result of each model bootstrapping the next, the growth of the industry, and the same engineering-driven progress as technologies before it.

Bootstrapping. We have recently started to see recursive self-improvement coming into the discourse, but viewed from the angle that each successive model unlocks the next, this has already been happening for years. The way to think about it is through the formula that a capable model plus human expertise equals a slightly more capable model. Even before the models reached a threshold where they could meaningfully contribute to model implementations, kernel optimizations, or architecture research, each previous model was used to bootstrap the data for the one that followed. Looking at GPT-3 first, albeit this was already starting to happen with earlier transformer-based models, OpenAI detailed how they trained a classifier to sample higher quality documents from Common Crawl (Brown et al. 2020, Appendix A). Using classifier-based methods to augment or filter existing data was common at the time, and the approach evolved into techniques like rejection sampling with grader models (Lambert, RLHF Book, ch. 9). With InstructGPT (Ouyang et al. 2022), which led to GPT-3.5 and ChatGPT, a reward model was trained to predict which outputs would be preferred by graders. This reward model was trained starting from a supervised fine-tuned version of GPT-3. Without the advances in capabilities of GPT-3 itself, InstructGPT would likely not have been possible as quickly. With GPT-4, this trend very likely continues (GPT-4 Technical Report 2023) as they mention they trained a specific reward model for safety tasks "such as refusing to generate harmful content or not refusing innocuous requests" and future models like DeepSeek-R1 state they create reward systems for "mathematical, coding, and logical reasoning domains" (DeepSeek-R1 2025).

Continuing the trend, the next leap in capabilities was the "o" series of reasoning models and the subsequent unlock of verifiable rewards. Reasoning models made the observation that "chain-of-thought" reasoning was highly effective, but required users to literally write "think step by step" to elicit the behavior. To bake this into the model, what was done? Reasoning text was generated with the help of the previous models. From the DeepSeek-R1 paper: "Specifically, we first engage human annotators to convert the reasoning trace into a more natural, human conversational style. The modified data pairs are then used as examples to prompt an LLM to rewrite additional data in a similar style.". From that point, using previous models to curate and generate data became ubiquitous and continually increased in scope and scale, particularly through verifiable rewards. As we are quite familiar with, domains like software engineering or security became incredibly amenable to fully verifiable environments that scaled easily once models become good enough to reliably create tests or inject bugs into code. With each step up in capability, more tasks become solvable in this loop: today's model builds a reliable verifier, while the next model learns to solve the task against it. For example, diagrams with high-quality text or CAD drawings can now have effectively infinite training data, programmatically generated and verified by current models. As an aside, thinking about which tasks can have data generated, or verifiable environments created with current models is a great indication of what tasks might (depending on what the labs focus on) become solved.

Industry Growth. Bootstrapping was mainly about the growth of data, but expansion has happened everywhere else as well. The bottlenecks to better models are data, compute (often to generate data as the previous section motivated), and the understated reliability and bug fixes. Even if the amount of compute available might not accelerate as quickly, that does not mean there aren't other areas that effectively function as increases in capacity. Compute efficiency is improving from silicon, to racks, to data centers, to kernels, and to inference. So not only is supply still growing, we are getting more out of every FLOP. Next, data generation creating that reward signal is an embarrassingly parallel task - you can run many task rollouts at once. Parallelism coupled with efficiency, even while keeping quality fixed, still results in progress. Finally, there are several other sources to consider. More talent than ever can contribute both inside and outside the frontier labs - trying different ideas, architectures, and harnesses. Beyond the labs, a cottage industry supports frontier model training by creating environments, open source tools, or even the data flywheel from the millions upon millions of users.

AI Progress is on a Technical Trajectory. The idea that progress would slow down, in part due to horizon lengths increasing, missed the plethora of other directions and new bottlenecks that could be improved. Previous models contributing to later ones means data keeps scaling in an ever growing number of dimensions, and industry growth means that compute keeps climbing along axes other than raw FLOPs. With all of this, claiming that AI would slow down within a year or two of becoming mainstream is equivalent to thinking that silicon had hit a wall in the 90s, or that battery technology had hit a wall in the 2000s. People called the end of Moore's Law for decades, but engineers always found new axes: process nodes, multi-core, and specialized accelerators. Every major technology goes through this kind of march of improvement, and it takes running into the laws of physics to stop it. As more people are thinking about the problem and more capital is invested, the pace usually picks up. A few years ago, we were only just beginning the technical trajectory. Now that we are on it, it will keep going.