Reddit to charge for API access over AI training concerns

Social news aggregation and discussion website Reddit will begin charging companies for access to its API.

Reddit says it’s making the decision over concerns about companies using the API to train large language models (LLMs).

The company says that its pricing will be divided into tiers to support companies of different sizes, with different usage limits and broader usage rights offered at each tier. However, the exact pricing details have not yet been disclosed.

The value of Reddit’s data has been well-known for some time, and it is a highly valuable resource for companies looking to train AI chatbots.

“The Reddit corpus of data is really valuable,” Steve Huffman, Founder and CEO of Reddit, told The New York Times. “But we don’t need to give all of that value to some of the largest companies in the world for free.”

Reddit’s move comes at a time when AI has gone from niche to big business seemingly overnight, and there are rumours that Reddit is looking to go public later this year.

By introducing this new and potentially lucrative revenue stream, Reddit can set itself up for a successful IPO.

Reddit is not the only online repository of information used to train LLMs. Other data scrapers like Common Crawl also help to train chatbots by scraping billions of web pages monthly.

Common Crawl and related services trade in raw data, which refers to large pools of information sitting online, whereas Reddit consists of conversations between humans. For an AI to be well-rounded and capable of increasing factual accuracy and person-like behaviour, it requires access to both types of data.

In an independent analysis of 12 million of the 2.3 billion images used to train text-to-image model Stable Diffusion, conducted by Andy Baio and Simon Willison, they found it was trained using images from Common Crawl.

“Unsurprisingly, a large number came from stock image sites. 123RF was the biggest with 497k, 171k images came from Adobe Stock’s CDN at ftcdn.net, 117k from PhotoShelter, 35k images from Dreamstime, 23k from iStockPhoto, 22k from Depositphotos, 22k from Unsplash, 15k from Getty Images, 10k from VectorStock, and 10k from Shutterstock, among many others,” wrote the researchers.

According to the analysis, many images scraped by Common Crawl are from sites with high amounts of user-generated content. Earlier this year, stock image service Getty Images sued Stable Diffusion creator Stability AI over alleged copyright infringement.

Aside from training AI chatbots, Reddit’s API is also used to create and maintain content moderation tools.

Reddit is creating dedicated moderation tools in the form of iOS and Android apps instead of charging content moderators to access the API. The apps will feature a mod log, rules management tools, mod queue information and more.