The New York Times Cracks Down on AI Scraping Its Content

The New York Times Cracks Down on AI Scraping Its Content
Image Credit: Maginative

The New York Times recently updated its terms of service to explicitly forbid the scraping of its content to train artificial intelligence models, reflecting growing concerns among publishers about AI and the future of journalism.

On August 3rd, the Times added language to its TOS barring the use of its text, images, videos and metadata to develop machine learning or AI systems. This includes content scraping by website crawlers, which index pages for search engines but can also provide data to train large language models like ChatGPT.

Earlier this year, the New York Times signed a three-year, $100 million deal that allows Google to feature content across its platforms. The deal covered collaborating on content distribution, subscriptions, marketing, advertising, and experimentation. So it's possible the Times' updated terms of service are aimed at companies like OpenAI or Microsoft rather than partners like Google.

Defying these new rules could result in penalties, although the Times did not specify what those might entail.

While AI firms have been tight-lipped about their training data, an investigation by the Washington Post and researchers at the Allen Institute for AI found that AI models like Meta's LLaMA have been trained on content from major publishers like the New York Times. Publishers worry this undermines subscription revenue by enabling AI to recreate their work and risks generating misinformation that erodes public trust.

Talks are currently underway on licensing deals where AI firms would compensate publishers. In July, the Associated Press announced a partnership to license part of the their text archive to OpenAI, and examine potential use cases for generative AI in news products and services. However, beyond payments, publishers want proper citation of work and guardrails against factual errors.

Just last week, The Arthur L. Carter Journalism Institute at New York University received a $395,000 grant from OpenAI to launch a new ethics initiative focused on privacy, disinformation, and the use of AI tools in reporting. Additionally, the company launched GPTBot to improve their models while giving publishers control to limit or restrict their content from being crawled.

As AI becomes further enmeshed with news, expect more negotiations around data usage and attribution. The Times' scraping crackdown is likely just the beginning.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.

Subscribe