How do I Follow robots.txt When Using AI Web Scrapers?

When building AI-powered web scrapers with GPT or other language models, it's crucial to respect the robots.txt file that websites use to communicate crawling policies. Even though AI scrapers work differently from traditional scrapers, they should still follow the same ethical guidelines and technical standards established by the Robots Exclusion Protocol.

Understanding robots.txt in the Context of AI Scraping

The robots.txt file is a standard used by websites to communicate with web crawlers about which parts of their site should or shouldn't be accessed. While AI web scrapers often use different approaches than traditional scrapers—such as analyzing page content with language models or using browser automation—they still need to respect these rules.

AI web scrapers typically combine two components: 1. A traditional scraping layer (HTTP requests or browser automation) that fetches HTML content 2. An AI layer (GPT, Claude, or other LLMs) that processes and extracts structured data from the HTML

The robots.txt compliance must happen at the first layer, before the content reaches the AI processing stage.

Fetching and Parsing robots.txt

Python Implementation

Here's a comprehensive Python example using the urllib.robotparser module along with OpenAI's GPT API:

import urllib.robotparser
import requests
from urllib.parse import urljoin, urlparse
from openai import OpenAI

class RobotsCompliantAIScraper:
    def __init__(self, user_agent="MyAIBot/1.0"):
        self.user_agent = user_agent
        self.robots_parsers = {}
        self.openai_client = OpenAI()

    def get_robots_parser(self, base_url):
        """Get or create a robots.txt parser for a domain."""
        if base_url not in self.robots_parsers:
            robots_url = urljoin(base_url, '/robots.txt')
            rp = urllib.robotparser.RobotFileParser()
            rp.set_url(robots_url)
            try:
                rp.read()
                self.robots_parsers[base_url] = rp
            except Exception as e:
                print(f"Error reading robots.txt: {e}")
                # Create permissive parser if robots.txt doesn't exist
                rp = urllib.robotparser.RobotFileParser()
                self.robots_parsers[base_url] = rp

        return self.robots_parsers[base_url]

    def can_fetch(self, url):
        """Check if URL can be fetched according to robots.txt."""
        parsed = urlparse(url)
        base_url = f"{parsed.scheme}://{parsed.netloc}"

        rp = self.get_robots_parser(base_url)
        return rp.can_fetch(self.user_agent, url)

    def get_crawl_delay(self, url):
        """Get the crawl delay specified in robots.txt."""
        parsed = urlparse(url)
        base_url = f"{parsed.scheme}://{parsed.netloc}"

        rp = self.get_robots_parser(base_url)
        delay = rp.crawl_delay(self.user_agent)
        return delay if delay else 0

    def scrape_with_ai(self, url, prompt):
        """Scrape URL with AI, respecting robots.txt."""
        # Check robots.txt before fetching
        if not self.can_fetch(url):
            raise PermissionError(f"robots.txt disallows fetching {url}")

        # Respect crawl delay
        import time
        delay = self.get_crawl_delay(url)
        if delay > 0:
            print(f"Respecting crawl delay: {delay} seconds")
            time.sleep(delay)

        # Fetch the page
        headers = {'User-Agent': self.user_agent}
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        # Use GPT to extract data
        completion = self.openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are a web scraping assistant. Extract structured data from HTML."},
                {"role": "user", "content": f"{prompt}\n\nHTML:\n{response.text[:15000]}"}  # Limit context
            ]
        )

        return completion.choices[0].message.content

# Usage example
scraper = RobotsCompliantAIScraper(user_agent="MyCompany-AIBot/1.0")

try:
    url = "https://example.com/products/laptop"

    # Check if we can scrape
    if scraper.can_fetch(url):
        data = scraper.scrape_with_ai(
            url,
            "Extract the product name, price, and description as JSON"
        )
        print(data)
    else:
        print(f"Cannot scrape {url} - disallowed by robots.txt")
except Exception as e:
    print(f"Error: {e}")

JavaScript/Node.js Implementation

For JavaScript developers, here's an implementation using the robots-parser library with the OpenAI API:

const robotsParser = require('robots-parser');
const axios = require('axios');
const { OpenAI } = require('openai');
const { URL } = require('url');

class RobotsCompliantAIScraper {
    constructor(userAgent = 'MyAIBot/1.0') {
        this.userAgent = userAgent;
        this.robotsParsers = new Map();
        this.openai = new OpenAI({
            apiKey: process.env.OPENAI_API_KEY
        });
    }

    async getRobotsParser(baseUrl) {
        if (!this.robotsParsers.has(baseUrl)) {
            const robotsUrl = new URL('/robots.txt', baseUrl).href;

            try {
                const response = await axios.get(robotsUrl);
                const parser = robotsParser(robotsUrl, response.data);
                this.robotsParsers.set(baseUrl, parser);
            } catch (error) {
                console.log(`Error fetching robots.txt: ${error.message}`);
                // Create permissive parser if robots.txt doesn't exist
                const parser = robotsParser(robotsUrl, '');
                this.robotsParsers.set(baseUrl, parser);
            }
        }

        return this.robotsParsers.get(baseUrl);
    }

    async canFetch(url) {
        const parsedUrl = new URL(url);
        const baseUrl = `${parsedUrl.protocol}//${parsedUrl.host}`;

        const parser = await this.getRobotsParser(baseUrl);
        return parser.isAllowed(url, this.userAgent);
    }

    async getCrawlDelay(url) {
        const parsedUrl = new URL(url);
        const baseUrl = `${parsedUrl.protocol}//${parsedUrl.host}`;

        const parser = await this.getRobotsParser(baseUrl);
        return parser.getCrawlDelay(this.userAgent) || 0;
    }

    async scrapeWithAI(url, prompt) {
        // Check robots.txt before fetching
        const allowed = await this.canFetch(url);
        if (!allowed) {
            throw new Error(`robots.txt disallows fetching ${url}`);
        }

        // Respect crawl delay
        const delay = await this.getCrawlDelay(url);
        if (delay > 0) {
            console.log(`Respecting crawl delay: ${delay} seconds`);
            await new Promise(resolve => setTimeout(resolve, delay * 1000));
        }

        // Fetch the page
        const response = await axios.get(url, {
            headers: { 'User-Agent': this.userAgent }
        });

        // Use GPT to extract data
        const completion = await this.openai.chat.completions.create({
            model: 'gpt-4-turbo-preview',
            messages: [
                {
                    role: 'system',
                    content: 'You are a web scraping assistant. Extract structured data from HTML.'
                },
                {
                    role: 'user',
                    content: `${prompt}\n\nHTML:\n${response.data.substring(0, 15000)}`
                }
            ]
        });

        return completion.choices[0].message.content;
    }
}

// Usage example
(async () => {
    const scraper = new RobotsCompliantAIScraper('MyCompany-AIBot/1.0');

    try {
        const url = 'https://example.com/products/laptop';

        if (await scraper.canFetch(url)) {
            const data = await scraper.scrapeWithAI(
                url,
                'Extract the product name, price, and description as JSON'
            );
            console.log(data);
        } else {
            console.log(`Cannot scrape ${url} - disallowed by robots.txt`);
        }
    } catch (error) {
        console.error(`Error: ${error.message}`);
    }
})();

Best Practices for AI Web Scraping with robots.txt

1. Use a Descriptive User Agent

Always identify your AI scraper with a clear, descriptive user agent string:

user_agent = "MyCompanyBot/1.0 (+https://mycompany.com/bot-info)"

This allows website administrators to: - Identify your bot in their logs - Contact you if there are issues - Set specific rules for your bot in robots.txt

2. Respect Crawl Delays

Many robots.txt files specify a Crawl-delay directive:

User-agent: *
Crawl-delay: 10

Even if you're using AI to process fewer pages, respect these delays to avoid overwhelming the server.

3. Cache robots.txt

Don't fetch robots.txt for every request. Cache it for a reasonable period (e.g., 24 hours):

from datetime import datetime, timedelta

class CachedRobotsParser:
    def __init__(self):
        self.cache = {}
        self.cache_duration = timedelta(hours=24)

    def get_parser(self, base_url):
        now = datetime.now()

        if base_url in self.cache:
            parser, timestamp = self.cache[base_url]
            if now - timestamp < self.cache_duration:
                return parser

        # Fetch and cache new parser
        parser = self._fetch_robots_parser(base_url)
        self.cache[base_url] = (parser, now)
        return parser

4. Handle robots.txt Errors Gracefully

If robots.txt is unavailable (404, timeout, etc.), follow a conservative approach:

def get_robots_parser_safe(url):
    try:
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(urljoin(url, '/robots.txt'))
        rp.read()
        return rp
    except:
        # If robots.txt is unavailable, assume everything is allowed
        # but log the issue and proceed cautiously
        print("Warning: Could not fetch robots.txt, proceeding cautiously")
        return None

AI-Specific Considerations

Pre-filtering Content Before AI Processing

When working with AI scrapers, you might want to fetch multiple URLs and batch process them. Ensure robots.txt compliance happens before batching:

async def batch_scrape_with_ai(urls, prompt):
    allowed_urls = []

    # Filter URLs based on robots.txt
    for url in urls:
        if scraper.can_fetch(url):
            allowed_urls.append(url)
        else:
            print(f"Skipping {url} - disallowed by robots.txt")

    # Fetch allowed URLs
    html_contents = []
    for url in allowed_urls:
        response = requests.get(url, headers={'User-Agent': scraper.user_agent})
        html_contents.append(response.text)

    # Process with AI
    # ... AI processing logic

Browser Automation with AI

If you're using browser automation tools like Puppeteer or Playwright alongside AI processing, check robots.txt before navigating:

const puppeteer = require('puppeteer');

async function scrapeWithBrowser(url, aiPrompt) {
    // Check robots.txt first
    const scraper = new RobotsCompliantAIScraper();
    if (!await scraper.canFetch(url)) {
        throw new Error('URL disallowed by robots.txt');
    }

    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Set user agent
    await page.setUserAgent(scraper.userAgent);

    // Navigate and extract content
    await page.goto(url);
    const html = await page.content();

    await browser.close();

    // Process with AI
    return await scraper.processWithAI(html, aiPrompt);
}

For more details on browser automation, see our guide on handling browser sessions in Puppeteer.

Testing Your robots.txt Compliance

Command-Line Testing

You can test robots.txt compliance from the command line:

# Python
python -c "import urllib.robotparser; rp = urllib.robotparser.RobotFileParser(); rp.set_url('https://example.com/robots.txt'); rp.read(); print(rp.can_fetch('MyBot/1.0', 'https://example.com/page'))"

# Using curl to view robots.txt
curl https://example.com/robots.txt

Unit Testing

Include robots.txt compliance in your test suite:

import unittest
from unittest.mock import patch, MagicMock

class TestRobotsCompliance(unittest.TestCase):
    def setUp(self):
        self.scraper = RobotsCompliantAIScraper()

    @patch('urllib.robotparser.RobotFileParser.can_fetch')
    def test_respects_robots_txt(self, mock_can_fetch):
        mock_can_fetch.return_value = False

        with self.assertRaises(PermissionError):
            self.scraper.scrape_with_ai(
                'https://example.com/disallowed',
                'Extract data'
            )

    @patch('urllib.robotparser.RobotFileParser.can_fetch')
    def test_allows_permitted_urls(self, mock_can_fetch):
        mock_can_fetch.return_value = True

        # Should not raise an exception
        # ... rest of test

Legal and Ethical Considerations

Following robots.txt is not just a technical requirement—it's an ethical obligation and can have legal implications:

Respect website owner wishes: robots.txt represents the website owner's preferences
Avoid legal issues: While robots.txt compliance isn't legally binding in all jurisdictions, ignoring it can support claims of unauthorized access
Maintain good relationships: If you're scraping for commercial purposes, respecting robots.txt helps maintain positive relationships with data sources
API alternatives: Many sites that restrict scraping via robots.txt offer APIs—consider using those instead

When building AI-powered scrapers, consider the broader implications of data collection. For guidance on monitoring network requests in Puppeteer, which can help you understand and optimize your scraping behavior.

Conclusion

Respecting robots.txt when using AI web scrapers is essential for ethical and responsible data collection. By implementing proper robots.txt parsing, caching, and compliance checking before your AI processing layer, you ensure that your scraping activities align with website owners' preferences and industry standards.

Remember that AI scrapers should follow the same rules as traditional scrapers when it comes to respecting access controls. The AI component processes the data after it's collected—the collection itself must still be done responsibly and in compliance with robots.txt directives.

For more advanced scraping scenarios, explore our article on handling authentication in Puppeteer to learn about accessing protected content responsibly.

Table of contents

How do I Follow robots.txt When Using AI Web Scrapers?

Understanding robots.txt in the Context of AI Scraping

Fetching and Parsing robots.txt

Python Implementation

JavaScript/Node.js Implementation

Best Practices for AI Web Scraping with robots.txt

1. Use a Descriptive User Agent

2. Respect Crawl Delays

3. Cache robots.txt

4. Handle robots.txt Errors Gracefully

AI-Specific Considerations

Pre-filtering Content Before AI Processing

Browser Automation with AI

Testing Your robots.txt Compliance

Command-Line Testing

Unit Testing

Legal and Ethical Considerations

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are prompt engineering examples for web scraping?

How can I integrate OpenAI with my web scraping service?

What is LLM web scraping and when should I use it?

Get Started Now

Support