According to a recent Cloudflare report, Perplexity's stealth crawlers

According to a Cloudflare investigation that was started when its customers reported that Perplexity was still accessing their content despite being restricted, the company is allegedly using stealth and undeclared crawlers to get around no-crawl directives. According to Cloudflare, Perplexity hides its activity by altering their source autonomous source networks (ASNs) and user agent. Perplexity's crawlers are not retrieving robots.txt files, according to Cloudflare; therefore, disregard the guidelines in those files.


Because of these unethical behaviors, Cloudflare has chosen to remove Perplexity from its list of verified bots. This will have an impact on how Perplexity interacts with websites, particularly those who use Cloudflare services for security.

Perplexity uses its declared PerplexityBot user agent by default, but it changes to a generic browser agent (Chrome/124.0.0.0 Safari/537.36) anytime a website disables it. The stealth crawler rotates through many ASNs and employs a number of IPs that are not included in Perplexity's declared range. This behavior was not unique either; Cloudflare noticed that it was occurring across tens of thousands of domains and involving millions of requests every day, indicating that it was a pattern of perplexity.

Web crawling companies such as OpenAI specify their crawlers in detail and adhere to network bans and robots.txt directions. After testing ChatGPT's crawlers, Cloudflare discovered that when a black page or forbidden directive was displayed, the crawlers ceased.

Cloudflare has fixed the problem by adding heuristics to its managed rules that prevent the stealth crawling. These heuristic measures are available to all customers, including those who use Cloudflare's services for free, and are currently in place for users who have set up challenge rules or bot management.

Instead of hardcoding which crawlers to ban, Cloudflare uses heuristic blocking, which searches for specified behaviors and blocks crawlers that it believes are breaking them. These heuristic blockers ought to be able to keep combating this tendency as Perplexity's strategies evolve.

Additionally, Cloudflare stated that the company is actively collaborating with global technical and policy experts, such as the IETF's attempts to standardize robots.txt extensions. This will assist in establishing quantifiable guidelines that well-intentioned bot operators ought to follow.

Previous Post Next Post