OpenAI has released a new web crawler named GPTBot to gather publicly available data from the internet for training AI models. This launch comes amidst recent controversies where tech companies were accused of scraping websites without explicit consent to power large language models like GPT-4.
GPTBot aims to be more transparent, properly identifying itself so webmasters can allow or disallow access. The bot uses the user agent token "GPTBot" and a full user-agent string clearly stating it is from OpenAI.
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
OpenAI states that GPTBot will only access sites that do not require paywall sign-ins, gather personally identifiable user data, or contain policy-violating text. The company claims allowing the bot can help improve the accuracy and capabilities of AI systems.
Webmasters can fully block GPTBot by adding its user agent token to their robots.txt file. They can also selectively allow access to certain directories while restricting others. OpenAI has published the IP ranges GPTBot uses so websites can identify its traffic.
User-agent: GPTBot
Disallow: /
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
This launch reflects OpenAI's response to recent backlash over large language models like GPT-4 being trained on website data without explicit approval. Even if the content is publicly accessible, critics argue it should still require opt-in agreements for AI training. There are also concerns about content being taken out of context when fed into AI systems.
The launch of GPTBot highlights the gray areas around using publicly available data to develop AI models, which can benefit from large training datasets. It exemplifies the emerging ethics debates as AI capabilities advance. Going forward, clearer privacy guidelines and ethical frameworks will be needed to find the right balances.