Tech Giants Push Boundaries to Access AI Training Data

April 7, 2024 • 2 min read

In the intense race to develop cutting-edge artificial intelligence, tech giants and research labs like OpenAI, Google, and Meta are grappling with a critical challenge: securing vast amounts of high-quality data to train their A.I. models. As reported by The New York Times, these companies have resorted to questionable practices, skirting corporate policies and even discussing ways to circumvent copyright law in their pursuit of training data.

Copyright law protects the intellectual property rights of creators, and the use of copyrighted material without permission is generally prohibited. However, the concept of "fair use" allows limited use of copyrighted material for purposes like criticism, commentary, news reporting, teaching, and research. AI companies often rely on the fair use doctrine to justify their data collection practices, arguing that their use of copyrighted material is transformative and does not infringe on the creator's rights.

Equally, the collection and use of personal data, such as facial images, voice recordings, and online behavior, have raised concerns about individual privacy and consent. While laws like the EU's General Data Protection Regulation (GDPR) and the US's Children's Online Privacy Protection Act (COPPA) aim to protect user data, loopholes and ambiguous language in these laws create opportunities for data exploitation.

When OpenAI, creators of the GPT-4 model and ChatGPT faced a data shortage in late 2021, researchers developed a speech recognition tool called Whisper, which transcribed audio from over one million hours of YouTube videos. This move potentially violated YouTube's rules prohibiting the use of its content for independent applications. According to three sources, some OpenAI employees, including president Greg Brockman, were aware of the legal gray area but proceeded with the data collection.

The changes Google made to its privacy policy last year for its free consumer apps. Source: Google. By the New York Times

Similarly, Google transcribed YouTube videos to harvest text for its AI models, potentially infringing on video creators' copyrights. The company also broadened its terms of service to allow tapping into publicly available Google Docs, restaurant reviews, and other online content for A.I. products. Meta, the parent company of Facebook and Instagram, discussed purchasing the publishing house Simon & Schuster to acquire long-form content and considered gathering copyrighted data from across the internet, even if it meant facing legal repercussions.

As the supply of high-quality online data dwindles, tech companies are exploring alternative solutions, such as creating "synthetic" data generated by A.I. models themselves. However, this approach has its limitations and challenges, as A.I.-generated data may reinforce the systems' biases and mistakes.

The landscape of AI and copyright law is indeed complex, with ongoing legal and ethical debates. For example, OpenAI and other companies face lawsuits for allegedly training AI models on copyrighted work without consent. Companies are exploring various approaches, such as partnerships and revenue-sharing models, to address these concerns. Additionally, there's uncertainty around the copyrightability of AI-generated works, prompting discussions on regulations and protections for creators