AI Is Running Out Of Things To Read

Llama 2 was trained on one point eight trillion tokens of text. Llama 3, released about a year later, was trained on fifteen trillion. A token is a fragment of a word, a few characters the model treats as a single unit, and a language model learns by predicting the next one across the entire pile. Eight times more text in a single generation. Every frontier model is built this same way, by reading a copy of the internet and learning to continue it. The models keep getting larger, and larger models need more to read. The appetite doubles and redoubles. The supply does not. There is one internet, and it was written once.

The Reservoir

In 2024 a research group called Epoch AI tried to measure the bottom of the well. They put the total stock of human-generated public text worth training on at around three hundred trillion tokens. Then they plotted that number against how fast the labs are consuming it. The two lines cross. Their central estimate for the year the industry reads everything worth reading is 2028, with a window from 2026 to 2032, and sooner if models keep being overtrained, which they are.

Three hundred trillion sounds like a number without an end. It is not. It is every public web page, every digitized book, every forum thread and review and comment that humanity has put online, counted once. That number does not grow at the speed of a data center. It grows at the speed of people writing things down, which has a hard ceiling, because there are only so many people and so many hours in a life. Compute can be manufactured. Power can be generated. Text has to be lived first and written second.

Ilya Sutskever, who helped build the field, named the limit at the end of 2024. He called data "the fossil fuel of AI", told a room of researchers that "we have but one internet," and said the industry had reached peak data. Pre-training as the labs know it will end. The fuel was laid down across the whole history of human writing, and the engines now burn through it faster than it forms.

The Fences

For two decades the web was open by default. Anyone could crawl it, and the labs did. The first generation of large models was trained on a commons nobody thought to charge for. That era is closing, because when a resource becomes scarce the people who hold it build fences.

Reddit made the fence visible first. In early 2024 it signed a deal letting Google train on its archive of human conversation, reported at sixty million dollars a year. Weeks later, in its filing to go public, Reddit disclosed two hundred and three million dollars in data-licensing contracts. OpenAI agreed to pay News Corp more than two hundred and fifty million dollars over five years for the Wall Street Journal and a dozen other mastheads. The New York Times took the other route and sued. License it or litigate it, the message is the same. The text is no longer free.

The deeper consequence is who ends up holding it. The reservoirs of high-value human writing are not spread evenly across the web. They pool inside a small number of platforms. Two decades of Reddit conversation. The news archives. The question-and-answer sites, the code repositories, the video transcripts. Whoever owns those owns the raw input to every model that comes after, and they have learned they can charge rent for it. The feedstock of artificial intelligence is consolidating into the same few hands that already hold the compute and the cloud.

The Closed Loop

If the well runs dry, the obvious move is to make more water. Let the models write the training data for the next models. Synthetic data is the industry's favored escape hatch, and for narrow, checkable tasks it works.

At scale it breaks down. In 2024 a team of researchers documented the effect in Nature and named it model collapse. Train a model on the output of another model, then train a third on the second, and the copies degrade. The rare cases go first. The tails of the distribution thin out, and each generation drifts toward a blander, narrower middle until the model, in the paper's words, begins to "mis-perceive reality." A photocopy of a photocopy. The detail that made the original worth learning from is the exact detail that disappears.

This is not a distant risk. The web is already filling with generated text, so the next crawl pulls in a mixture of human and machine writing that nobody can fully tell apart. The researchers describe a first-mover advantage that should unsettle everyone still building. The snapshot of the internet captured before the flood of AI content is the last clean one. Every copy after it is contaminated by the previous model's output. The most valuable text in the world is now the text a human wrote before 2023, and there will never be more of it.

The boom runs on three finite inputs. The electricity to power the data centers. The chips made in a single place. And now the data, which is the strangest of the three, because it cannot be poured in Arizona or added to the grid. The reservoir of human thought was filled once, by people, across the entire history of writing, and it is being read to the bottom inside a single decade. The companies racing to the frontier understand this. It is why they are signing the licenses, buying the archives, and fencing the commons that raised them. The output of every model traces back to something a person wrote down, and that was always the quiet dependency underneath the machine.