The AI race among tech giants is only getting worse with each passing day and Nvidia’s AI data scraping of YouTube is the latest example of the unfair practices that have come to light. A majority of the tech giants are battling it out to be the first to reach their vague concept of an AI-powered future, and they’re willing to do anything to get there–even if it means engaging in underhanded practices. Over the last few months, many companies have been crawling and scraping websites to gather data to train their AI and currently, there are no explicit laws stopping them from doing just that. 

No one is surprised to hear that Nvidia is stealing video data, but the scale of the data scraping is quite staggering. According to 404 Media, the journalistic website that first broke the news, data obtained from YouTube is being used to power Nvidia’s AI video training endeavors and supporting AI models like the Cosmo deep learning model, the Omniverse 3D world generator, and its self-driving car systems. 

Nvidia isn’t as popularly known for its AI services, only for its chips that are capable of supporting the demands of AI, but that does not appear to be enough. Despite being at the forefront of the AI chip race with its GPUs, the company is looking at selling AI foundational base models to others who may pay big money to build on them. 

Nvidia AI data scandal

Image: NVIDIA Omniverse

Nvidia Caught Data Scraping for AI Training, According to Leaked Documents

The open and free internet has been treated as a free-for-all buffet and most tech giants have been using any data they can find to put their AI offerings a step above competition. YouTube has one of the biggest reserves of user-created content that covers the entire spectrum of knowledge and human experience, and the best part? These videos are free. 

Statista reports that over 500 hours of videos are uploaded to YouTube every minute. Despite the company’s strategy to fight ad blockers and force users to shift to a Premium account, it hasn’t faltered in its identity as the prime video-sharing service available for creators of more long-form content. It has allowed for its content to be easily accessible, and it appears tech companies can’t help but take advantage of that. 

404 Media gained access to leaked documents and internal messages within the Nvidia team, and the data indicated that Nvidia was aware that YouTube would not have approved of what they were doing. According to the documents, Nvidia went to great lengths to conceal its activity from YouTube by using “virtual machines” to refresh the IP address and prevent the company from being alerted to any unusual activity. 

No Content Is Off-Limits For Nvidia’s Video Training and Other AI Plans

The company has claimed its practices as being in “full compliance with the letter and spirit of copyright law,” but they haven’t asked YouTube or its users whether they agree with this assessment. Employees who questioned the decision were asked to lock their concerns away with the excuse of these changes being executive decisions. The employees working on these projects were told they had clearance to use the content for their higher-ups. 

The data obtained on the company’s activities also revealed that Nvidia’s data scraping for AI had knowingly misappropriated a YouTube dataset meant for academic research for its commercial models. These academic datasets are designed to serve a specific purpose and help other researchers build on the work available. The HD-VG-130M dataset reference in the article has a usage license that states it is meant for academic use only and cannot be copied or duplicated for commercial purposes, but that has not stopped it being used for other goals.

It is also evident that YouTube wasn’t the only video platform being scraped—Nvidia’s AI data scandal extends to platforms like Netflix as well. The leaked chats clearly show earlier conversations about using movies because they were a good source of data and evidence of gaming-like 3D consistency. There is also some talk of Nvidia stealing video data by having someone watch movies and screen capture what they watch. We aren’t certain if this was translated into a legitimate business strategy from the chat conversations obtained by 404 Media.

Reddit Takes a Stand Against Data Scraping

Recently, Reddit actively blocked data crawlers from scoping out their website to scrape data for training AI. This is a daring move for the company because it will now hurt its chances of ranking well in the search with the content uploaded to the website, Still, its deal with Google puts the company on top of search results with the Google search engine at least, so the company isn’t all that concerned about losing out on some free views. 

The content on Reddit is still being used to train Google’s AI so users don’t get a say in whether their content is monetized this way. Only the platform will be paid by the search engine company. This has reawaked the conversation around Google’s monopoly of the search industry but that is a conversation for another day. 

For now, most of the top websites are actively blocking OpenAI and other AI models from crawling across their pages, but there are only so many checks they can set up for the mythical Hydra-like entity that is AI.

Nvidia Is Scraping YouTube Freely, But It Is Isn’t the Only One

The Nvidia AI data scraping report may be worrying but the widespread normalization of companies using data from the internet to freely train their AI is a bigger cause for concern. AI laws worldwide have not caught up to the trends just yet, which means there are no real restrictions on Nvidia scraping YouTube or any other platform. They only have to step away and claim fair use after the act. By the time more detailed laws come in, Nvidia and other such companies will have profited from the open internet sufficiently that these regulations will only inhibit future competition.

An earlier investigation by Proof News found that the likes of Apple, Nvidia, and Salesforce had been stealing subtitles from YouTube videos for their own use. Video from creators like MKBHD, Mr. Beast, and even John Oliver were used for this purpose, but they only made up the top of the list of more than 48,000 channels that had been misused this way.

Some media houses have signed deals with different AI companies to license their content to them, but not every company or individual has been as willing to give in. The New York Times has been locked in a legal battle against OpenAI and Microsoft for a while now. The news agency is leading the cause for many media houses but creative professionals from writers to artists have taken a stand against the phenomenon of exploitative, unlicensed AI training. 

More recently, YouTuber David Millette filed a class action lawsuit against OpenAI. The lawsuit claims that the AI leader transcribed his and other creators’ videos to train their models, profiting significantly from their work and violating copyright laws in the process. The YouTuber isn’t the first to have taken a stand against OpenAI, and he likely won’t be the last.

Still, considering the scale of Nvidia’s AI data scandal and the number of AI models that have plagued the internet over the last two years, it appears there is little that can be done by creators to maintain the rights over their content.