The digital landscape is becoming increasingly complex, with the advent of large language models (LLMs) like ChatGPT adding a new layer of intricacy. According to Antoine Vastel, PhD, Head of Research at DataDome, cybercriminals no longer require sophisticated coding skills to execute damaging attacks against online businesses and customers.
This is due to the rise of easily accessible tools such as bots-as-a-service, residential proxies, CAPTCHA farms, and now, ChatGPT. ChatGPT, developed by OpenAI, and other LLMs have not only raised ethical concerns due to their training on scraped data from across the internet, but they also pose a significant threat to businesses.
According to the original article – These models can negatively impact web traffic, which can be detrimental to business operations. Three key risks posed by LLMs, particularly ChatGPT and its plugins, are:
- Content theft (editorial: ok, honestly, there’s zero increased additional risk of content theft because everyone was scraping and rewriting everyone else’s content algorithmically before anyway)
- Reduced website traffic (sounds like a personal problem), and;
- Data breaches (this goes mostly unexplained in the original article other than to say “people accidentally put things on the internet that they shouldn’t and somehow this is a problem with LLMs” This isn’t a breach per se.).
This article argues that Content theft can undermine the authority and perceived value of original content, while reduced traffic can result from users getting answers directly through ChatGPT and its plugins, bypassing the need to visit your pages.
According to Anton, Data breaches, or even unintentional broad distribution of sensitive data, are becoming increasingly likely. Industries most at risk from ChatGPT-driven damage are those where data privacy is paramount, unique content and intellectual property are key differentiators, and ads, eyes, and unique visitors are a significant source of revenue.
These include e-commerce, streaming, media, and publishing, and classified ads. ChatGPT gets its training data from several datasets, including Common Crawl, WebText2, Books1 and Books2, and Wikipedia.
The largest amount of training data comes from Common Crawl, an open repository of web crawl data. However, businesses wishing to allow or block CCBot, the Common Crawl crawler bot, should not rely solely on the user agent to identify it, as many malicious bots spoof their user agents to disguise themselves as legitimate bots. While blocking CCBot might be effective for blocking ChatGPT scrapers today, there is no telling what the future holds for LLM scrapers.
If too many websites block OpenAI from accessing their content, the developers could decide to stop respecting robots.txt and could stop declaring their crawler identity in the user agent. Alternatively, OpenAI could use its partnership with Microsoft to access Microsoft Bing’s scraper data, making the situation more challenging for website owners.
One of the limitations of models like ChatGPT is the lack of access to live data. To overcome this, plugins are used to connect LLMs like ChatGPT to external tools and allow the LLMs to access external data available online, which can include private data and real-time news. However, allowing users to interact with your website through third-party ChatGPT plugins can result in fewer ads seen by your users, as well as lower traffic to your website.
Users may also be less willing to pay for your premium features once your features can be replicated through third-party ChatGPT plugins. Blocking requests from plugins that declare their presence with the “ChatGPT-User” substring by user agent is one way to block ChatGPT’s web scrapers. However, blocking the user agent could also block ChatGPT users with the “browsing” mode activated. And, contrary to what OpenAI documentation might indicate, blocking requests from “ChatGPT-User” does not guarantee that ChatGPT and its plugins can’t reach your data under different user agent tokens.
In the long term, companies like OpenAI and Google may be tempted to use Bingbots and Googlebots to build datasets to train their LLMs. That would make it more difficult for websites to simply opt out of having their data collected, since most online businesses rely heavily on Bing and Google to index their content and drive traffic to their site.
Websites with valuable data will either want to look for ways to monetize the use of their data or opt out of AI model training to avoid losing web traffic and ad revenue to ChatGPT app and its plugins. If you wish to opt out, you’ll need advanced bot detection techniques, such as fingerprinting, proxy detection, and behavioral analysis, to stop bots before they can access your data.
Advanced solutions for bot and fraud protection leverage AI and machine learning to detect and stop unfamiliar bots from the first request, keeping your content safe from LLM scrapers, unknown plugins, and other rapidly evolving AI technologies.
-This article was rewritten by a bot based on Antoine’s original analysis.