The News
New data from Hostinger analysis of 66.7 billion bot requests reveals a critical pivot in AI crawler strategy: while OpenAI’s training crawler (GPTBot) is being blocked at record rates (dropping to 12% coverage), their new “Search” crawler (OAI-SearchBot) has successfully penetrated 55.67% of the web.
The Situation
The market has bifurcated. Publishers aggressively blocked “training” bots to protect IP, but have left the door open for “search” bots under the assumption of traffic referrals. OpenAI has exploited this loophole. By rebranding ingestion as “search discovery,” they have maintained access to over half the internet’s real-time data despite the backlash against model training. This is not a passive index; it is high-frequency extraction. Cloudflare data indicates OpenAI’s crawlers can hit 3,700 pages for every single user referral sent back to the site.
Why It Matters
This signals the death of the “crawling-for-traffic” contract that sustained the open web for two decades.
- The Zero-Click Economy: Unlike Google, which indexes to route users to you, LLM search engines index to answer users for you. You are subsidizing the server costs of a competitor that is reselling your content as its own answers.
- Infrastructure Inflation: AI crawlers are resource-heavy, often requesting pages repeatedly without caching logic, driving up bandwidth bills for content-heavy startups by thousands of dollars per month with zero ROI.
- The Trojan Horse: By splitting bots into “Training” (bad) and “Search” (good?), OpenAI forces a binary choice: disappear from the world’s fastest-growing search engine, or feed the machine that aims to replace you.
Founder Action
1. Audit Your robots.txt Immediately:
Distinguish between GPTBot (training) and OAI-SearchBot (search). Most legacy blocking only catches the former. If you have unique proprietary data, block both unless you have a licensing deal.
2. Measure the “Crawl-to-Referral” Ratio:
Check your server logs. If OpenAI is crawling 10,000 pages to send you 5 visitors, you are not a partner; you are a data mine. Block them or implement aggressive rate limiting at the CDN level (Cloudflare/AWS).
3. Shift to “Fed” API Models:
Stop relying on being scraped. If your data is valuable to LLMs, build a structured API and force them to ingest via a metered, authenticated pipe. If they won’t pay, your data isn’t the product your existence is the cost center.