Already A.I. is running out of data. If the bigger the dataset, the smarter the model, what happens when A.I. platforms lack enough original content to train on?
The possibility that A.I. platforms could face a shortage of new content should be of greatest concern to search engines and automated content platforms.
One way to solve this possible shortfall is by paying a fee to the content providers for the materials that A.I. providers use to train large language models (L.L.M.). We can explain how this might work or why it might be necessary, but first some clarity of terms and intents.
This piece is not about whether artificial intelligence (A.I.) is going to replace humans or destroy the earth. The possibility of A.I. supreme dominance keeps many people awake at their desks, and although we do find the topic interesting, it is not what we are discussing here. In fact, to be clear, this is not about the current or future possibilities of A.I. of which I’m not an expert.
Instead, this piece focuses on the impact of A.I. to the business model of digital publishers. By publishing, we mean the quality content that A.I. seeks to train its L.L.M. on; this content might be found in news reporting, academic papers, current or historical documentation, video creation, opinion, blog postings, online discussions, and so on.
The solutions put forward here are not (only) to protect the livelihoods of humans who create these materials, but also to ensure the livelihood of A.I. itself. After all, if an A.I. service provides out-of-date or diluted content, it won’t be useful to the consumer.
Finally, the category of “A.I.” we are discussing here is the general public’s consumption through querying on an A.I.-powered search browser or provider such as ChatGPT, Claude, or Gemini, or by engaging with sites that are built and maintained by A.I. and supplied entirely by A.I.-generated content.
Model Collapse
The possibility of data shortage affects both A.I. providers and the general public, but left unsolved it’s an existential crisis for A.I.
Currently, A.I. platforms rely on available information, scraped from databases of digital or digitized content (books, newspapers, research papers, videos, postings, unsecured personal exchanges, and so forth).
When content is paid for at all, it’s often in the form of a paywall or the consumer being subjected to advertising. If A.I. providers are “consuming” this content to train L.L.M., then they should pay for it.
After vacuuming up millennia of digitally-documented, human-generated content, the major A.I. platforms are now able to furnish fully-automated websites or provide an instant response to whatever a user prompts.
But what happens if people stop producing new content? Within a very short period of time, A.I. would be out of date.
Why Would Humans Stop Producing Content?
How can humans or publishing outlets afford to produce quality content in the quantities they currently do if they get no recognition or compensation in return?
When a user poses a question on one of the major A.I.-powered search engines, for instance, chances are that a text summary will appear at the top of the page, supplying enough information that the user need not search further. This instant response is a result of an A.I.-driven search tool having “read” through all the available information, summarizing it, and possibly in the future, even concluding or extending it further than the initial content (a goal of artificial general intelligence or A.G.I.). Professional researchers, journalists, and specialists will hopefully go deeper to visit sources, check for accuracy, and so on, but for the average user, there may be little reason or appetite to click through to consume more information. Those average users represent a high percentage of viewers who will never visit the sources of where A.I. found its information.
This leaves the original content provider all but invisible. No one bought the book, read the newspaper, hired the expert, liked the opinion piece, watched the video, listened to the podcast, interacted with the blogger. It’s as if the source creators don’t exist. And if we continue on this path, they won’t.
We’ve faced intellectual property trespass issues with each new technology that is able to reproduce original material. In the U.S., the current protection is copyright law, but that barrier was already blown through when L.L.M. trained on historical data largely without authorized rights. It’s unrealistic to think that most content producers can afford to fight endless lawsuits against this massive assault on their business or livelihoods.
To get a high volume of quality reportage, research, expertise, or subject-specific documentation, A.I. platforms will need to negotiate some sort of payment system to keep their original sources alive or risk cutting off the source itself.
For instance, there’s a local digital news source for my community that raises revenue to pay journalists and production costs through advertising; there is no paid subscriber model. If A.I. scrapes the new site’s daily feed, there will be no need for readers to visit the site directly to find out what happened in their community. Even if some long-time readers still browse it, the site is unlikely to gain new followers when A.I. delivers enough of the news directly in response to a search query or lifts it entirely for an automated news site. If the original news source loses enough readership, it will struggle to get or keep advertisers, and by extension, revenue. Its journalists will not be paid, nor will they gain exposure through their bylines to translate into another job in the same industry or a tangental freelance career. When this journal and other for-profit digital news sources stop publishing because they are bankrupt, where will A.I. find its news?
Another example is experts who engage in content marketing or thought leadership. Let’s take a video creator or blogger who offers advice on a specific area, for instance, home repairs or finance. Many of these business people post content to generate exposure to their services. If no one clicks through to their videos or reads their blog, why would they continue to market their services in this manner? And in turn, if the experts don’t post the video or write the blog, where will A.I. find new content to train on?
Starving the Source
A.I. platforms have built a database using existing resources, without payment, credit or consent to the original sources.
Recently, search engine responses have started to credit one or two sources (which also adds credibility to the search engine’s summary). This practice is not consistent, comprehensive, or even always accurate, but even if it gets better, the average user is not likely to go deeper to read the original source material, nor even notice that there are sources listed at the bottom of the content they were seeking.
We’ve been fed an efficient system of search, receive, move on. If we can get enough content without engaging with the source directly (visiting the site, buying the book, talking to the expert), then the original provider will collect no income. Neither from the A.I. platforms, nor advertising, nor the consumer directly.
Most humans, however, rely on a form of ongoing income for their survival. Here “income” refers to any payment that motivates a person or institution to create content and make it available to the public. Payment forms can include advertising, paywall, acknowledgement or credit, product or service sales, speaking or appearance fees, licensing, and so on.
With no income source, only a fraction of quality content producers will still be able to offer digitally accessible output. Within a year or two of crippling accredited content providers, A.I.’s results will begin to look very outdated indeed. All the millennia of historical data does not save us from having to capture and document future data.
How can A.I. providers keep the host body alive at least enough to get more quality content to feed their ravenous data appetites?
Can’t A.I. Just Replace Humans?
It could be that A.I. renders all human content providers useless and is still able to provide an up-to-date output. Some scenarios could be:
- A.I. becomes sentient to the point where it updates its own source content. This may be possible someday, somehow, but not likely in the gap between when quality content providers stop producing content and A.I. data becomes out of date. Instead, it’s more likely that A.I. will scrape other A.I. as a source, leading to inaccurate, diluted results that become useless.
- Publicly-available content would still be generated by humans who are sponsored by deep-pocketed individuals. This requires a lot of beneficent benefactors who agree to adhere to rigorous standards and accuracy of quality content, and who will employ enough humans to comprehensively cover all of humanity’s output. Count me among the skeptics that this would be enough to replace the old-fashioned system of humans, across the globe, creating original content for some form of payment.
- A portion of the world goes back to valuing analog content, where data cannot be so easily captured digitally. A.I. “goes away” as a credible source. This seems unlikely to happen at any material scale, given a large volume of consumers are already trained to expect immediate automated results.
- A.I. providers don’t care what our standards are and serve up slop. Left with no other choice, we don’t only accept whatever A.I. feeds us, but we also lose all perspective of what quality content could or should be. To some degree we see this already. MP3 is a lousy standard for audio fidelity but initially it was convenient and cheap to store. Spotify creates human-free “music” using A.I. and serves it to a passive, uncritical audience. Over time, what might have seemed like a limitation of technology, actually starts to shape preference.
Barring these outcomes, there is another solution that could at least bridge the gap until or in case A.I. does indeed run out of new material.
A Digital Key for Micro and Macro Payments
The good news is that we have an army of humans creating quality content across a diverse range of subjects and formats already. Output is at human speeds, but it’s better than no new original content at all. These workers just need to earn a living to keep going.
Currently if content is paid for at all, it’s usually in the form of a paywall or the consumer being subjected to advertising. Now it is the A.I. providers that are “consuming,” so they should pay for the content.
Content providers could place a digital token within their content that sets the terms and fees of their usage rights. If fees and terms are accepted, the digital key unlocks the content for usage.
Content providers filter into at least two distinct entities for how and when A.I. platforms might use their contents: the quotidian and the one offs.
Quotidian hits. Those that constantly post new material (say a newspaper or Wikipedia, for instance) likely will be scraped by A.I. continually. These visits would be so regular that the A.I. could pay a smaller visit fee in exchange for the high volume of usage. Repeat usage of content would amount to how licensing fees, royalties or residuals work today.
One-off hits. In cases where a content provider publishes a one-time creation, say a book, a white paper or a life’s work in one publication, then the A.I. platform would need to visit only once since the content is largely static and not (regularly) updating. In these cases, the A.I. provider would need to pay more for its single-use. How much to charge can be set by the provider. The A.I. platform can either agree to pay and abide by the terms (in which case the digital key would unlock) or this content is simply locked down against A.I. to train on.
Anthropic recently settled a lawsuit for taking a library of 500,000 books without payment or permission to train its large language models on. The settlement payout amounts to $3,000 a book, which feels too small an amount to compensate authors going forward. After all, the book author will only get paid once, at least in the digitized form (presumably people will still buy printed books), since there is no reason for the A.I. to return and register another payment, if it can simply take all the contents of the book on the first go round. On the other hand, $3,000 may be too high an amount for one blog post or one recipe. So here the size and effort involved in the content will be relevant to the price set by the poster.
Given how capitalism works, the costs incurred by A.I. providers likely will be passed on to the consumer. In fact, currently ChatGPT, Claude, Gemini, and others already charge for usage, they just aren’t sharing their bounty with the people or institutions that created the parts required to build the product they are selling.
The amount of settlements already being paid for content stolen by outfits like OpenAI and Anthropic is in the billions of dollars. These lawsuits are launched by large publishers or mega-media producers with deeper-pockets who can push back against the equally deep pocketed A.I. providers. These entities are also setting up negotiated deals for using their data in the future.
That may be good news, for organizations the size of Bertelsmann, the New York Times or the BBC. But who will pay for the independent press or book publisher? The home improvement video creator, the researched white paper examining the long-term effects of subsidizing farmers, the fashion blogger? Why would an author make a book digitally available, if she receives no payment or recognition for her efforts?
It’s not enough for the big entities to negotiate their own deals. We need a deal that covers all the humans or institutions providing worthy content for L.L.M. to train on.
Illustration by Tasha Walters.
Sign up for our free newsletter, find us on instagram and linkedin, and tell us what you think.
Robin D Rusch Robin Rusch




