The AI race among tech giants is only getting worse with each passing day and Nvidia’s AI data scraping of YouTube is the latest example of the unfair practices that have come to light. A majority of the tech giants are battling it out to be the first to reach their vague concept of an AI-powered future, and they’re willing to do anything to get there–even if it means engaging in underhanded practices. Over the last few months, many companies have been crawling and scraping websites to gather data to train their AI and currently, there are no explicit laws stopping them from doing just that.
No one is surprised to hear that Nvidia is stealing video data, but the scale of the data scraping is quite staggering. According to 404 Media, the journalistic website that first broke the news, data obtained from YouTube is being used to power Nvidia’s AI video training endeavors and supporting AI models like the Cosmo deep learning model, the Omniverse 3D world generator, and its self-driving car systems.
Nvidia isn’t as popularly known for its AI services, only for its chips that are capable of supporting the demands of AI, but that does not appear to be enough. Despite being at the forefront of the AI chip race with its GPUs, the company is looking at selling AI foundational base models to others who may pay big money to build on them.

Image: NVIDIA Omniverse
Nvidia Caught Data Scraping for AI Training, According to Leaked Documents
The open and free internet has been treated as a free-for-all buffet and most tech giants have been using any data they can find to put their AI offerings a step above competition. YouTube has one of the biggest reserves of user-created content that covers the entire spectrum of knowledge and human experience, and the best part? These videos are free.
Statista reports that over 500 hours of videos are uploaded to YouTube every minute. Despite the company’s strategy to fight ad blockers and force users to shift to a Premium account, it hasn’t faltered in its identity as the prime video-sharing service available for creators of more long-form content. It has allowed for its content to be easily accessible, and it appears tech companies can’t help but take advantage of that.
404 Media gained access to leaked documents and internal messages within the Nvidia team, and the data indicated that Nvidia was aware that YouTube would not have approved of what they were doing. According to the documents, Nvidia went to great lengths to conceal its activity from YouTube by using “virtual machines” to refresh the IP address and prevent the company from being alerted to any unusual activity.
No Content Is Off-Limits For Nvidia’s Video Training and Other AI Plans
The company has claimed its practices as being in “full compliance with the letter and spirit of copyright law,” but they haven’t asked YouTube or its users whether they agree with this assessment. Employees who questioned the decision were asked to lock their concerns away with the excuse of these changes being executive decisions. The employees working on these projects were told they had clearance to use the content for their higher-ups.
The data obtained on the company’s activities also revealed that Nvidia’s data scraping for AI had knowingly misappropriated a YouTube dataset meant for academic research for its commercial models. These academic datasets are designed to serve a specific purpose and help other researchers build on the work available. The HD-VG-130M dataset reference in the article has a usage license that states it is meant for academic use only and cannot be copied or duplicated for commercial purposes, but that has not stopped it being used for other goals.
Now that it’s leaked that Nvidia execs approved YouTube / Netflix scraping for AI training, here’s who makes training data decisions (and who knows about them) at AI companies:
– AI engineers suggest data sources, often steered by tech leadership
– Before the model is released,… https://t.co/MwyeDogXHs— Ed Newton-Rex (@ednewtonrex) August 5, 2024
Nvidia scraped videos from Youtube and academic sources to compile training data for an unannounced video LLM
Every time a new model drops, there are a few questions I keep coming back to:
1) What’s the best way to structure multi-modal data for model creation?
2) Where does… pic.twitter.com/orlIHWrQjb— Daniel Chesley (@DanielChesley) August 6, 2024
It is also evident that YouTube wasn’t the only video platform being scraped—Nvidia’s AI data scandal extends to platforms like Netflix as well. The leaked chats clearly show earlier conversations about using movies because they were a good source of data and evidence of gaming-like 3D consistency. There is also some talk of Nvidia stealing video data by having someone watch movies and screen capture what they watch. We aren’t certain if this was translated into a legitimate business strategy from the chat conversations obtained by 404 Media.
Reddit Takes a Stand Against Data Scraping
Recently, Reddit actively blocked data crawlers from scoping out their website to scrape data for training AI. This is a daring move for the company because it will now hurt its chances of ranking well in the search with the content uploaded to the website, Still, its deal with Google puts the company on top of search results with the Google search engine at least, so the company isn’t all that concerned about losing out on some free views.
The content on Reddit is still being used to train Google’s AI so users don’t get a say in whether their content is monetized this way. Only the platform will be paid by the search engine company. This has reawaked the conversation around Google’s monopoly of the search industry but that is a conversation for another day.
For now, most of the top websites are actively blocking OpenAI and other AI models from crawling across their pages, but there are only so many checks they can set up for the mythical Hydra-like entity that is AI.
“Nvidia scraped YouTube en masse, had clearance from highest levels of the company.”
OpenAI, MidJourney, Stable Diffusion, Dall-E, Sudowrite… they’re all guilty of mass theft. May this be the card that brings the whole house all down. https://t.co/vm107NtjEH
— Rick the rabble-rousing Luddite, 🚫AI (@dreamosaurus) August 5, 2024
Nvidia Is Scraping YouTube Freely, But It Is Isn’t the Only One
The Nvidia AI data scraping report may be worrying but the widespread normalization of companies using data from the internet to freely train their AI is a bigger cause for concern. AI laws worldwide have not caught up to the trends just yet, which means there are no real restrictions on Nvidia scraping YouTube or any other platform. They only have to step away and claim fair use after the act. By the time more detailed laws come in, Nvidia and other such companies will have profited from the open internet sufficiently that these regulations will only inhibit future competition.
Google getting scraped by Nvidia is a shitty corporate move against another shitty corporate (consumer) lock in platform.
Creators or consumers aren’t benefiting from Nvidia scraping YouTube. There’s no archiving or distribution happening, just data for training a product.
— Not a Data Scientist (@isthisdata) August 6, 2024
An earlier investigation by Proof News found that the likes of Apple, Nvidia, and Salesforce had been stealing subtitles from YouTube videos for their own use. Video from creators like MKBHD, Mr. Beast, and even John Oliver were used for this purpose, but they only made up the top of the list of more than 48,000 channels that had been misused this way.
cool cool cool cool cool cool now leaked NVIDIA slack messages discussing which YouTube channels to scrape videos from. MKBHD videos? Yeah grab those too. https://t.co/0XczvTNVBH
— Marques Brownlee (@MKBHD) August 5, 2024
Some media houses have signed deals with different AI companies to license their content to them, but not every company or individual has been as willing to give in. The New York Times has been locked in a legal battle against OpenAI and Microsoft for a while now. The news agency is leading the cause for many media houses but creative professionals from writers to artists have taken a stand against the phenomenon of exploitative, unlicensed AI training.
More recently, YouTuber David Millette filed a class action lawsuit against OpenAI. The lawsuit claims that the AI leader transcribed his and other creators’ videos to train their models, profiting significantly from their work and violating copyright laws in the process. The YouTuber isn’t the first to have taken a stand against OpenAI, and he likely won’t be the last.
Still, considering the scale of Nvidia’s AI data scandal and the number of AI models that have plagued the internet over the last two years, it appears there is little that can be done by creators to maintain the rights over their content.