How do I Follow robots.txt When Using AI Web Scrapers?
When building AI-powered web scrapers with GPT or other language models, it's crucial to respect the robots.txt
file that websites use to communicate crawling policies. Even though AI scrapers work differently from traditional scrapers, they should still follow the same ethical guidelines and technical standards established by the Robots Exclusion Protocol.
Understanding robots.txt in the Context of AI Scraping
The robots.txt
file is a standard used by websites to communicate with web crawlers about which parts of their site should or shouldn't be accessed. While AI web scrapers often use different approaches than traditional scrapers—such as analyzing page content with language models or using browser automation—they still need to respect these rules.
AI web scrapers typically combine two components: 1. A traditional scraping layer (HTTP requests or browser automation) that fetches HTML content 2. An AI layer (GPT, Claude, or other LLMs) that processes and extracts structured data from the HTML
The robots.txt compliance must happen at the first layer, before the content reaches the AI processing stage.
Fetching and Parsing robots.txt
Python Implementation
Here's a comprehensive Python example using the urllib.robotparser
module along with OpenAI's GPT API:
import urllib.robotparser
import requests
from urllib.parse import urljoin, urlparse
from openai import OpenAI
class RobotsCompliantAIScraper:
def __init__(self, user_agent="MyAIBot/1.0"):
self.user_agent = user_agent
self.robots_parsers = {}
self.openai_client = OpenAI()
def get_robots_parser(self, base_url):
"""Get or create a robots.txt parser for a domain."""
if base_url not in self.robots_parsers:
robots_url = urljoin(base_url, '/robots.txt')
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
self.robots_parsers[base_url] = rp
except Exception as e:
print(f"Error reading robots.txt: {e}")
# Create permissive parser if robots.txt doesn't exist
rp = urllib.robotparser.RobotFileParser()
self.robots_parsers[base_url] = rp
return self.robots_parsers[base_url]
def can_fetch(self, url):
"""Check if URL can be fetched according to robots.txt."""
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}"
rp = self.get_robots_parser(base_url)
return rp.can_fetch(self.user_agent, url)
def get_crawl_delay(self, url):
"""Get the crawl delay specified in robots.txt."""
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}"
rp = self.get_robots_parser(base_url)
delay = rp.crawl_delay(self.user_agent)
return delay if delay else 0
def scrape_with_ai(self, url, prompt):
"""Scrape URL with AI, respecting robots.txt."""
# Check robots.txt before fetching
if not self.can_fetch(url):
raise PermissionError(f"robots.txt disallows fetching {url}")
# Respect crawl delay
import time
delay = self.get_crawl_delay(url)
if delay > 0:
print(f"Respecting crawl delay: {delay} seconds")
time.sleep(delay)
# Fetch the page
headers = {'User-Agent': self.user_agent}
response = requests.get(url, headers=headers)
response.raise_for_status()
# Use GPT to extract data
completion = self.openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a web scraping assistant. Extract structured data from HTML."},
{"role": "user", "content": f"{prompt}\n\nHTML:\n{response.text[:15000]}"} # Limit context
]
)
return completion.choices[0].message.content
# Usage example
scraper = RobotsCompliantAIScraper(user_agent="MyCompany-AIBot/1.0")
try:
url = "https://example.com/products/laptop"
# Check if we can scrape
if scraper.can_fetch(url):
data = scraper.scrape_with_ai(
url,
"Extract the product name, price, and description as JSON"
)
print(data)
else:
print(f"Cannot scrape {url} - disallowed by robots.txt")
except Exception as e:
print(f"Error: {e}")
JavaScript/Node.js Implementation
For JavaScript developers, here's an implementation using the robots-parser
library with the OpenAI API:
const robotsParser = require('robots-parser');
const axios = require('axios');
const { OpenAI } = require('openai');
const { URL } = require('url');
class RobotsCompliantAIScraper {
constructor(userAgent = 'MyAIBot/1.0') {
this.userAgent = userAgent;
this.robotsParsers = new Map();
this.openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
}
async getRobotsParser(baseUrl) {
if (!this.robotsParsers.has(baseUrl)) {
const robotsUrl = new URL('/robots.txt', baseUrl).href;
try {
const response = await axios.get(robotsUrl);
const parser = robotsParser(robotsUrl, response.data);
this.robotsParsers.set(baseUrl, parser);
} catch (error) {
console.log(`Error fetching robots.txt: ${error.message}`);
// Create permissive parser if robots.txt doesn't exist
const parser = robotsParser(robotsUrl, '');
this.robotsParsers.set(baseUrl, parser);
}
}
return this.robotsParsers.get(baseUrl);
}
async canFetch(url) {
const parsedUrl = new URL(url);
const baseUrl = `${parsedUrl.protocol}//${parsedUrl.host}`;
const parser = await this.getRobotsParser(baseUrl);
return parser.isAllowed(url, this.userAgent);
}
async getCrawlDelay(url) {
const parsedUrl = new URL(url);
const baseUrl = `${parsedUrl.protocol}//${parsedUrl.host}`;
const parser = await this.getRobotsParser(baseUrl);
return parser.getCrawlDelay(this.userAgent) || 0;
}
async scrapeWithAI(url, prompt) {
// Check robots.txt before fetching
const allowed = await this.canFetch(url);
if (!allowed) {
throw new Error(`robots.txt disallows fetching ${url}`);
}
// Respect crawl delay
const delay = await this.getCrawlDelay(url);
if (delay > 0) {
console.log(`Respecting crawl delay: ${delay} seconds`);
await new Promise(resolve => setTimeout(resolve, delay * 1000));
}
// Fetch the page
const response = await axios.get(url, {
headers: { 'User-Agent': this.userAgent }
});
// Use GPT to extract data
const completion = await this.openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: [
{
role: 'system',
content: 'You are a web scraping assistant. Extract structured data from HTML.'
},
{
role: 'user',
content: `${prompt}\n\nHTML:\n${response.data.substring(0, 15000)}`
}
]
});
return completion.choices[0].message.content;
}
}
// Usage example
(async () => {
const scraper = new RobotsCompliantAIScraper('MyCompany-AIBot/1.0');
try {
const url = 'https://example.com/products/laptop';
if (await scraper.canFetch(url)) {
const data = await scraper.scrapeWithAI(
url,
'Extract the product name, price, and description as JSON'
);
console.log(data);
} else {
console.log(`Cannot scrape ${url} - disallowed by robots.txt`);
}
} catch (error) {
console.error(`Error: ${error.message}`);
}
})();
Best Practices for AI Web Scraping with robots.txt
1. Use a Descriptive User Agent
Always identify your AI scraper with a clear, descriptive user agent string:
user_agent = "MyCompanyBot/1.0 (+https://mycompany.com/bot-info)"
This allows website administrators to: - Identify your bot in their logs - Contact you if there are issues - Set specific rules for your bot in robots.txt
2. Respect Crawl Delays
Many robots.txt files specify a Crawl-delay
directive:
User-agent: *
Crawl-delay: 10
Even if you're using AI to process fewer pages, respect these delays to avoid overwhelming the server.
3. Cache robots.txt
Don't fetch robots.txt for every request. Cache it for a reasonable period (e.g., 24 hours):
from datetime import datetime, timedelta
class CachedRobotsParser:
def __init__(self):
self.cache = {}
self.cache_duration = timedelta(hours=24)
def get_parser(self, base_url):
now = datetime.now()
if base_url in self.cache:
parser, timestamp = self.cache[base_url]
if now - timestamp < self.cache_duration:
return parser
# Fetch and cache new parser
parser = self._fetch_robots_parser(base_url)
self.cache[base_url] = (parser, now)
return parser
4. Handle robots.txt Errors Gracefully
If robots.txt is unavailable (404, timeout, etc.), follow a conservative approach:
def get_robots_parser_safe(url):
try:
rp = urllib.robotparser.RobotFileParser()
rp.set_url(urljoin(url, '/robots.txt'))
rp.read()
return rp
except:
# If robots.txt is unavailable, assume everything is allowed
# but log the issue and proceed cautiously
print("Warning: Could not fetch robots.txt, proceeding cautiously")
return None
AI-Specific Considerations
Pre-filtering Content Before AI Processing
When working with AI scrapers, you might want to fetch multiple URLs and batch process them. Ensure robots.txt compliance happens before batching:
async def batch_scrape_with_ai(urls, prompt):
allowed_urls = []
# Filter URLs based on robots.txt
for url in urls:
if scraper.can_fetch(url):
allowed_urls.append(url)
else:
print(f"Skipping {url} - disallowed by robots.txt")
# Fetch allowed URLs
html_contents = []
for url in allowed_urls:
response = requests.get(url, headers={'User-Agent': scraper.user_agent})
html_contents.append(response.text)
# Process with AI
# ... AI processing logic
Browser Automation with AI
If you're using browser automation tools like Puppeteer or Playwright alongside AI processing, check robots.txt before navigating:
const puppeteer = require('puppeteer');
async function scrapeWithBrowser(url, aiPrompt) {
// Check robots.txt first
const scraper = new RobotsCompliantAIScraper();
if (!await scraper.canFetch(url)) {
throw new Error('URL disallowed by robots.txt');
}
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set user agent
await page.setUserAgent(scraper.userAgent);
// Navigate and extract content
await page.goto(url);
const html = await page.content();
await browser.close();
// Process with AI
return await scraper.processWithAI(html, aiPrompt);
}
For more details on browser automation, see our guide on handling browser sessions in Puppeteer.
Testing Your robots.txt Compliance
Command-Line Testing
You can test robots.txt compliance from the command line:
# Python
python -c "import urllib.robotparser; rp = urllib.robotparser.RobotFileParser(); rp.set_url('https://example.com/robots.txt'); rp.read(); print(rp.can_fetch('MyBot/1.0', 'https://example.com/page'))"
# Using curl to view robots.txt
curl https://example.com/robots.txt
Unit Testing
Include robots.txt compliance in your test suite:
import unittest
from unittest.mock import patch, MagicMock
class TestRobotsCompliance(unittest.TestCase):
def setUp(self):
self.scraper = RobotsCompliantAIScraper()
@patch('urllib.robotparser.RobotFileParser.can_fetch')
def test_respects_robots_txt(self, mock_can_fetch):
mock_can_fetch.return_value = False
with self.assertRaises(PermissionError):
self.scraper.scrape_with_ai(
'https://example.com/disallowed',
'Extract data'
)
@patch('urllib.robotparser.RobotFileParser.can_fetch')
def test_allows_permitted_urls(self, mock_can_fetch):
mock_can_fetch.return_value = True
# Should not raise an exception
# ... rest of test
Legal and Ethical Considerations
Following robots.txt is not just a technical requirement—it's an ethical obligation and can have legal implications:
- Respect website owner wishes: robots.txt represents the website owner's preferences
- Avoid legal issues: While robots.txt compliance isn't legally binding in all jurisdictions, ignoring it can support claims of unauthorized access
- Maintain good relationships: If you're scraping for commercial purposes, respecting robots.txt helps maintain positive relationships with data sources
- API alternatives: Many sites that restrict scraping via robots.txt offer APIs—consider using those instead
When building AI-powered scrapers, consider the broader implications of data collection. For guidance on monitoring network requests in Puppeteer, which can help you understand and optimize your scraping behavior.
Conclusion
Respecting robots.txt when using AI web scrapers is essential for ethical and responsible data collection. By implementing proper robots.txt parsing, caching, and compliance checking before your AI processing layer, you ensure that your scraping activities align with website owners' preferences and industry standards.
Remember that AI scrapers should follow the same rules as traditional scrapers when it comes to respecting access controls. The AI component processes the data after it's collected—the collection itself must still be done responsibly and in compliance with robots.txt directives.
For more advanced scraping scenarios, explore our article on handling authentication in Puppeteer to learn about accessing protected content responsibly.