What training compute does (and doesn't) tell you
Frontier training compute has grown ~4–5× a year and is the clearest driver of AI's recent leaps. It is a hard, auditable number — but it's an input, not a measure of intelligence.
Of all the numbers around frontier AI, training compute is among the most informative and the least hyped. It measures the floating-point operations used to train a model, and it has grown roughly 4–5× per year for over a decade. GPT-4 was the first model trained at the 1e25 FLOP scale in 2023; by 2025 frontier models had crossed 1e26 FLOP, a tenfold jump, and more than thirty models from a dozen developers had passed the GPT-4 threshold.
Why track an input rather than an output? Because compute is the clearest, hardest-to-spin driver behind the capability gains everyone argues about. You can debate what a chatbot "understands," but the size of the training run is an auditable fact, and it is tightly linked to cost — a frontier run now runs into hundreds of millions of dollars, which is why only a handful of well-capitalized labs can play.
2025 added a wrinkle: progress stopped coming only from bigger pre-training runs. Reasoning models — DeepSeek-R1, OpenAI's o-series, Claude's extended thinking — buy capability by spending more compute at answer time ("test-time compute") instead of, or on top of, training. So the single training-FLOP number now captures less of the story than it did; some of the frontier's recent gains live in how much a model is allowed to think per query, not just how big it was trained.
The caveat matters as much as the number. Compute is an input, not intelligence: efficiency gains, better data and new algorithms mean two models with similar compute can differ widely, and a bigger run does not guarantee a better model. We track compute because it is honest and explanatory, then pair it with benchmark scores and investment so no single lens stands alone.