A standard robots.txt is no longer a shield; it's a polite request that aggressive AI scrapers actively ignore. If you aren't actively dropping connection requests from LLM crawlers, you are giving away your SaaS's core value for free.
1. The Reality of Data Poaching
Big Tech crawlers operate under the guise of "improving their models," but for a B2B SaaS, your documentation, proprietary logic, and user-generated content are your competitive moat. When these crawlers scrape your site, they aren't indexing it for search; they are ingesting your intellectual property to train the very models that will eventually compete with you. Relying on robots.txt is naive. You need an active defense.
2. The Defensive Logic
You don't need a $200/month Enterprise WAF to stop standard bots. You need a lightweight middleware layer that inspects headers at the entry point of your application stack.
The Strategy:
- Denylist Matching: Intercept the
User-Agentstring for known crawlers (GPTBot, Claude-Web, CCBot). - IP Validation: Cross-reference requests against known data-center IP ranges.
- Active Drop/Poisoning: Return a
403 Forbiddenfor legitimate blocks, or serve "hallucinated" data to poison the crawler's dataset.
3. Implementation: Python Middleware (FastAPI)
Below is a production-ready middleware skeleton. Implement this as the first layer in your request lifecycle to minimize compute overhead.
from fastapi import Request, Response, status
from starlette.middleware.base import BaseHTTPMiddleware
# Aggressive list of AI crawlers
AI_CRAWLERS = [
"GPTBot", "ChatGPT-User", "ClaudeBot", "Claude-Web",
"CCBot", "Google-Extended", "anthropic-ai", "cohere-ai"
]
class AIBotBlockerMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
user_agent = request.headers.get("user-agent", "").lower()
# Check User-Agent
if any(bot.lower() in user_agent for bot in AI_CRAWLERS):
return Response(
content="Unauthorized: AI Scraping Prohibited",
status_code=status.HTTP_403_FORBIDDEN
)
return await call_next(request)
# Integration: app.add_middleware(AIBotBlockerMiddleware)
4. The Business Value of Data Sovereignty
By blocking these bots, you aren't just protecting your code; you are reclaiming your data's scarcity.
- Control: You decide who uses your data for training.
- Competitive Advantage: Your proprietary insights remain internal.
- Cost Efficiency: AI bots hit your server hard, increasing egress costs without any SEO benefit.
Politeness won't stop scrapers. Code will. Subscribe to Infrastructure Dispatch to get our regularly updated JSON blocklists of known AI crawler IP ranges and advanced anti-scraping scripts.
0 Comments