For data-hungry tech companies, YouTube is a gold mine
Companies competing in the chatbot wars have consistently turned to something known in the industry as “the Pile” to train their large language models.
The Pile is a massive trove of open-source data — about 800 gigabytes worth — that’s made up of several smaller datasets. It includes text scraped from all around the internet: Wikipedia, the European Parliament’s website, even a collection of emails from employees at the now-defunct energy company Enron.
Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” As Gilbertson found in her reporting, text from more than 170,000 videos are now being used by Silicon Valley heavyweights like Apple, Nvidia and Anthropic to train their LLMs.
Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
The following is an edited transcript of their conversation.
Annie Gilbertson: So, we looked at this data set called YouTube Subtitles, and we knew it had been used to train AI, but nobody had cracked it open and really reported on what was in there. And what we found was more than 170,000 YouTube videos had been swiped to train these artificial intelligence models. That includes channels labeled as education, Khan Academy, MIT, Harvard, and also news publishers, Wall Street Journal, NPR, BBC, entertainers, and then, of course, some of YouTube’s biggest stars, Marques Brownlee, Mr. Beast, PewDiePie, for example.
Lily Jamali: How did you and your team go about figuring out whether companies were using YouTube and how they were using it?
Gilbertson: So, tech companies are just very secretive about the type of training data that they’re using. Experts I’ve talked to said that this is for many reasons. One, it’s a competitive advantage. The better training data can produce better models. It’s an extremely competitive space right now, an arms race in AI model building. Another reason is that they could get in trouble. We’ve seen lawsuits against companies once the sources of their data have been exposed, and the creators who may own the copyright are not happy that their data has been taken. So, what we have here is often a black box, where the public does not know what data is being used to train artificial intelligence models.
And so, we wanted to connect the YouTube Subtitles dataset to specific tech companies. And we were able to do that because we started looking at any mentions of YouTube Subtitles, or the broader compilation of data called the Pile, which was put out by a Eleuther AI, a nonprofit organization that put out this training dataset. And the Pile includes a bunch of stuff, and there’s several references of the Pile or YouTube Subtitles in white papers or in posts on GitHub, for example, where companies mentioned using it, and that’s how we were able to connect it to the actual tech companies. And we found some really big names.
We found Apple is one, Anthropic, Bloomberg, Salesforce. These are all companies that we found that had been using either the Pile, the larger data set, or YouTube Subtitles subset of data in there. We reached out to all these companies, of course, for their comment, and Salesforce told me that they had used the data for research and then ended up releasing the model open source. Anthropic also confirmed use of it, but they denied wrongdoing, and everyone else either declined to comment or did not respond to my requests.
Jamali: And you interviewed some YouTubers who had their content included in the data set. One of them is Abigail Thorn. She is an actor and the creator of the channel Philosophy Tube. Here’s what she said about how she feels:
Abigail Thorn: To have someone take my writing and use it for profit – these companies are trying to get people to invest in their product, trying to get people to pay to use their product – and they have taken something that I put a little bit of my heart and soul into and they are selling it. It makes me feel violated.
Jamali: Violated. That’s a really strong word and I wonder what you what you make of that reaction?
Gilbertson: Abigail was not alone in feeling violated. A number of the individual YouTube creators that I spoke to said that it felt extremely disrespectful, that somehow that they were small enough players that you could just take something and not expect any kind of retaliation or reaction or the resources to maybe challenge the actions by these big tech companies, that it just made them feel small and unimportant. And these YouTubers put their lives into their projects. These are labors of love. So many of them spend 80 hours a week making their show. They make it extremely personal. They put a lot of their heart and soul into it. And this idea that it would be gobbled up by AI without their consent or without any sort of notification, was really disturbing. None of the YouTubers I spoke to said that they had been asked if it was okay that their content be used.
Jamali: Did anyone say they felt, I don’t know, positive or at least neutral about it?
Gilbertson: I think some people were maybe a little bit neutral in the sense that one of the YouTubers I talked to, who writes a history show, said “Oh, well, this just isn’t surprising at all, this is, capitalism up to its old tricks, they’re taking my labor and using it for profit, and I get nothing.” So, it was interesting talking to YouTubers with very specific backgrounds and lenses in which they see the world. And they kind of brought that to understanding what happened here. Some of them have used AI products to help their creative process, helping with research or organization. And they’re not opposed to integrating some of this technology. Some of them weren’t even opposed to sharing their data, they just felt like they should be told and given the option and maybe be compensated for their material.
The other thing I’ll say about this is that, for some of them, it was insult to injury, because these AI models may end up coming to compete for audiences with these YouTubers. We’ve seen what the video generating models can put out there now. I mean, what is that going to look like in five and 10 years? And how will that change the jobs of these creators? There’s already been some litigation, not with YouTube, but with authors who have had their work taken to train AI, and the language that the plaintiff’s attorneys use is something like, “you’re having them pave the way to their own destruction.” You’re using their creative work to train models that may then supplant their ability to work.
Jamali: This has obviously become a bit of a theme lately, big tech using online speech to train AI. Last fall, we learned that Meta is using Facebook and Instagram posts to train their AI model. OpenAI struck a deal with Reddit — some people might remember this from a few months ago — so posts on forums there can be scraped for ChatGPT. How do your findings fit into this pattern?
Gilbertson: Yeah, I think it’s really interesting that what’s clearly happening is some publishers are being paid for their work, and some are not. Certainly, when it comes to news where you and I both work in journalism, we’re seeing that some publishers are striking deals with AI companies and embracing it, and some are taking them to court, and some remain on the sidelines and undecided. And there kind of isn’t like a way to stay entirely neutral, because even if you’re not striking deals, chances are that it’s increasingly likely that your work is being vacuumed up by AI. For example, just before I got on the call with you, Lily, I looked up Marketplace in the YouTube Subtitles dataset and indeed, one of their videos is in YouTube Subtitles training data set.
Jamali: Oh. I’ll send my invoice over.
Gilbertson: So is NPR, so is BBC. You used to work at KQED, the Bay Area public radio station. They’re in there. So clearly a lot of news media is already being swept up and used to train AI. And the question is, should they be compensated?
Jamali: It sounds like there’s a lot at stake here.
Gilbertson: Yeah, I think it’s kind of changing the table stakes of the internet in some ways, right? Where I think some of us were used to being tracked and having our data used to sell us things online, right? Like that’s very much part of the internet, but to have our work and our personal content, whether it’s your post to Instagram or your reviews on Google maps of certain restaurants, it’s kind of totally different terms. And nobody I’ve spoken to said it’s going to change their patterns on the internet, but certainly it’s something that even just the regular internet user needs to now be aware of where their pictures and their texts may be being vacuumed up to train AI for these hundreds of billions of dollar valued companies, or in some cases dollar valued companies.
The future of this podcast starts with you.
Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.
As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.
Support “Marketplace Tech” in any amount today and become a partner in our mission.