Apple Accused of Training AI From Stolen YouTube Videos — But Is It True?

Toggle Dark Mode
Controversy erupted this week when an investigation revealed that several tech giants had used thousands of YouTube videos to train their AI models without permission from the creators. However, for its part, Apple has issued a statement denying this allegation, stating that the videos were used only for an experimental research model that had nothing to do with Apple Intelligence.
The investigation by Proof News revealed that subtitles from 173,536 YouTube videos were “siphoned from more than 48,000 channels” and “used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.”
The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did “The Late Show With Stephen Colbert,” “Last Week Tonight With John Oliver,” and “Jimmy Kimmel Live.”Proof News
The YouTube data was harvested and made available to these companies through a third party, a non-profit AI research lab known as EleutherAI. This data was part of a dataset called “the Pile,” which also included data from the European Parliament, English Wikipedia, and even “a trove of Enron Corporation employees’ emails that was released as part of a federal investigation into the firm.”
Following the report, Proof News built a lookup tool that creators could use to find out if their data had been misused in this manner. Representatives from Anthropic and Salesforce confirmed their use of the Pile to Proof News, while Apple, Databricks, and Bloomberg didn’t respond to requests for comment.
The Proof News report went on to note how Apple and several other companies are listed in research papers that acknowledge their use of the Pile for training AI large language models (LLMs). However, Apple has now issued a statement to several media outlets clarifying that this research has nothing to do with Apple Intelligence or any other commercial AI features that Apple is working on.
The model in question that’s explicitly outlined in the research is OpenELM, which Apple describes in an April 2024 Machine Learning Research Paper. However, as the paper notes in its executive summary, OpenELM’s goal is to “empower and strengthen the open research community, paving the way for future open research endeavors.”

Ethics around the training of AI LLMs are still murky, but the consensus is that it’s generally acceptable to use publicly available internet resources for purely non-profit research initiatives, and that’s even more true if they’re open research projects that benefit the wider community.
That’s precisely where OpenELM fits in. Apple emphasized in yesterday’s statement that OpenELM does not power Apple Intelligence or any of its other AI or machine learning features — nor was it ever intended to. In fact, Apple has added that it has no plans to build any new versions of the OpenELM model. What’s there already offers enough benefit to the open machine learning research community and will continue to do so for the foreseeable future.
Although some artists have expressed serious concerns about Apple’s lack of transparency in how it trains its AI models, the company maintains that it only does so ethically, using “licensed data” that it pays for, along with publicly available data collected by its web crawler.
We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.Apple
Unfortunately, that still leaves a lot of grey since even YouTube videos are technically “publicly available.” However, the Pile dataset clearly wasn’t involved in this, and we think it’s safe to assume that its AI training models know enough to steer clear of such a blatantly obvious library of copyrighted material.

Apple has reportedly signed licensing deals with stock image services such as Shutterstock and Photobucket, which are likely used to train its image generation AI and make up some of the “licensed data” that it’s referring to, and a report last year revealed the company was spending up to $50 million to train its AI from news sources in deals with high-profile publishers like Condé Nast and NBC News.
This certainly suggests that the multi-trillion-dollar company has no problem paying for its training sources, and it’s likely doing its best to avoid as many questionable areas as possible. Unfortunately, ownership and copyright on the public web can be tricky. With so many smaller content creators out there, Apple’s web crawler may still be hoovering up data that it shouldn’t. It’s provided a way for sites to opt out of this collection, but it’s unclear what it does with any data it may have collected before that point.
I’m certainly willing to give Apple the benefit of the doubt, but I also understand the concerns of the creative community. Hopefully, as the debut of Apple Intelligence approaches this fall, Apple will find ways to be more transparent about its training models and quell any fears that it’s using unethically sourced materials to train its AI.