The economy and ethics of AI training data

Matt Levin Jan 31, 2024

Heard on:

The economy and ethics of AI training data

Matt Levin Jan 31, 2024

Heard on:

By publishing something on the internet without explicitly telling other computers to avoid it, you're consenting to its use by AI, says Common Crawl's Rich Skrenta. Outflow Designs/Getty Images

Maybe the only industry hotter than artificial intelligence right now? AI litigation.

Just a sampling: Writer Michael Chabon is suing Meta. Getty Images is suing Stability AI. And both The New York Times and The Authors Guild have filed separate lawsuits against OpenAI and Microsoft.

At the heart of these cases is the allegation that tech companies illegally used copyrighted works as part of their AI training data.

For text focused generative AI, there’s a good chance that some of that training data originated from one massive archive: Common Crawl.

“Common Crawl is the copy of the internet. It’s a 17-year archive of the internet. We make this freely available to researchers, academics and companies,” said Rich Skrenta, who heads the nonprofit Common Crawl Foundation.

Since 2007, Common Crawl has saved 250 billion webpages, all in downloadable data files. Until recently some of its biggest users were academics, exploring topics like online hate speech and government censorship.

But now there’s another power user.

“I’ve been told by researchers that LLMs would not exist if it were not for Common Crawl,” said Skrenta.

LLMs stand for large language models, essentially the algorithms behind AI products like ChatGPT.

LLMs need to ingest huge chunks of text to learn the rhythm and structure of language, so they can write a convincing term paper or convincingly human-sounding wedding vows.

OpenAI, Google and Meta all used versions of Common Crawl in their early AI research.

Unless your 2009 “Glee” fan fiction blog is paywalled, or has code telling Common Crawl to avert its eyes, it’s pretty likely to be in Common Crawl, although there’s no easy way to look that up.

After ChatGPT came out, Skrenta says the number of websites that have blocked Common Crawl from archiving their material has doubled. And there’s been a big jump in requests to be removed from the existing archive.

Skrenta says by publishing something on the internet without explicitly telling robots to avoid it, you’re consenting to its use by AI.

“You posted your information on the internet, intentionally so that people could come and see it. And robots are people too,” said Skrenta.

Common Crawl isn’t the only text used to train AI. Researcher Luca Soldaini at the nonprofit Allen Institute for AI says we used to know a lot more about what training data tech companies used.

But that was before OpenAI got a $100 billion valuation.

“It’s not in their interest to tell us whats in there, both from a competitive advantage, a legal point of view,” said Soldaini.

Most of the major AI companies allow web publishers to opt out of future AI training data. But Soldaini says if companies were forced to retrain their current AI models without any material a user wants taken out, it would be incredibly costly and time-consuming.

And without all that copyrighted work to learn from, the AI might just stink.

Tech companies say taking copyrighted material to train AI is legally “fair use” — AI systems should be able to read and learn from the internet, just like humans do.

But beyond the legal debate, there’s also an ethical one.

“Every single creator among us has grown up with the full knowledge and the full acceptance that when we create, when we put that out in the world, people will learn from that,” said Ed Newton-Rex, the founder of the nonprofit startup Fairly Trained. “We did not come into the game expecting large corporations to scrape that, train on it, create these scalable systems. None of this is part of the social contract.”

Fairly Trained certifies AI systems that only use training data licensed or approved by its human creators.

Newton-Rex hopes the certification will allow consumers to decide which AI systems reflect their values, kinda like a fair trade sticker for robots.

“I don’t think people realize that when they use something like ChatGPT they are using a model trained in this way, trained on lots of people’s output without their consent, often without their knowledge and without compensation,” Newton-Rex said.

Stories You Might Like

Your donation today powers the independent journalism that you rely on. For just $5/month, you can help sustain Marketplace so we can keep reporting on the things that matter to you.

Also Included in

Tags in this Story

Share this Story

Latest Episodes From Our Shows

3:44 PM PST