How does Firecrawl handle robots.txt files?

Firecrawl is designed to respect robots.txt files by default, following web scraping best practices and ethical guidelines. Understanding how Firecrawl handles these directives is crucial for developers building compliant web scraping solutions.

What is robots.txt?

The robots.txt file is a standard used by websites to communicate with web crawlers and automated agents. It specifies which parts of a website can be crawled and which should be avoided. This protocol, while not legally binding in most jurisdictions, represents a website's preferences for automated access.

Firecrawl's Default Behavior

By default, Firecrawl respects robots.txt directives when crawling websites. This means:

Automatic Detection: Firecrawl automatically checks for robots.txt files at the root of each domain
Directive Compliance: The crawler follows disallow rules, crawl delays, and other directives
User-Agent Respect: Firecrawl identifies itself properly and follows user-agent-specific rules
Ethical Scraping: This approach ensures your scraping activities remain respectful of website owners' preferences

Using Firecrawl with robots.txt Compliance

When using Firecrawl's crawl endpoint, the service automatically handles robots.txt checking. Here's how to implement a basic crawl:

JavaScript/Node.js Example

import FirecrawlApp from '@firecrawl/firecrawl';

const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });

async function crawlWebsite() {
  try {
    const result = await app.crawlUrl('https://example.com', {
      limit: 100,
      scrapeOptions: {
        formats: ['markdown', 'html'],
      }
    });

    console.log('Crawl completed:', result);
  } catch (error) {
    console.error('Crawl failed:', error.message);
  }
}

crawlWebsite();

Python Example

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='YOUR_API_KEY')

def crawl_website():
    try:
        result = app.crawl_url(
            'https://example.com',
            params={
                'limit': 100,
                'scrapeOptions': {
                    'formats': ['markdown', 'html']
                }
            }
        )
        print('Crawl completed:', result)
    except Exception as e:
        print('Crawl failed:', str(e))

crawl_website()

In both examples, Firecrawl automatically checks and respects the robots.txt file at https://example.com/robots.txt before beginning the crawl operation.

Understanding robots.txt Directives

Firecrawl processes several key robots.txt directives:

User-Agent Directive

User-agent: *
Disallow: /admin/
Disallow: /private/

Firecrawl respects the User-agent directive and follows rules that apply to its specific user-agent string or the wildcard (*) pattern.

Crawl-Delay Directive

User-agent: *
Crawl-delay: 10

The crawl-delay directive specifies the minimum time (in seconds) between requests. Firecrawl honors this to avoid overwhelming servers, similar to how you might handle timeouts in Puppeteer when building custom scrapers.

Allow Directive

User-agent: *
Disallow: /search/
Allow: /search/public/

The Allow directive creates exceptions to Disallow rules. Firecrawl processes these rules in order, respecting the most specific match.

Configuring Crawl Behavior

While Firecrawl respects robots.txt by default, you can configure various aspects of the crawling behavior:

Setting Maximum Pages

const result = await app.crawlUrl('https://example.com', {
  limit: 50,  // Maximum number of pages to crawl
  scrapeOptions: {
    formats: ['markdown'],
    onlyMainContent: true
  }
});

Excluding Specific Paths

const result = await app.crawlUrl('https://example.com', {
  limit: 100,
  excludePaths: ['/login', '/admin/*'],  // Additional paths to exclude
  scrapeOptions: {
    formats: ['markdown', 'html']
  }
});

Including Specific Paths

const result = await app.crawlUrl('https://example.com', {
  limit: 100,
  includePaths: ['/blog/*', '/articles/*'],  // Only crawl these paths
  scrapeOptions: {
    formats: ['markdown']
  }
});

Handling robots.txt Restrictions

When a website's robots.txt file restricts crawling, Firecrawl will skip the disallowed URLs. Here's how to handle potential restrictions:

Error Handling

async function smartCrawl(url) {
  try {
    const result = await app.crawlUrl(url, {
      limit: 100,
      scrapeOptions: {
        formats: ['markdown']
      }
    });

    if (result.success) {
      console.log(`Successfully crawled ${result.data.length} pages`);
      return result.data;
    }
  } catch (error) {
    if (error.message.includes('robots.txt')) {
      console.log('Crawling restricted by robots.txt');
      // Fallback to single page scraping
      return await scrapeSinglePage(url);
    }
    throw error;
  }
}

async function scrapeSinglePage(url) {
  const result = await app.scrapeUrl(url, {
    formats: ['markdown', 'html']
  });
  return [result];
}

Python Error Handling

def smart_crawl(url):
    try:
        result = app.crawl_url(
            url,
            params={
                'limit': 100,
                'scrapeOptions': {
                    'formats': ['markdown']
                }
            }
        )

        if result.get('success'):
            print(f"Successfully crawled {len(result['data'])} pages")
            return result['data']
    except Exception as e:
        if 'robots.txt' in str(e):
            print('Crawling restricted by robots.txt')
            # Fallback to single page scraping
            return scrape_single_page(url)
        raise e

def scrape_single_page(url):
    result = app.scrape_url(
        url,
        params={
            'formats': ['markdown', 'html']
        }
    )
    return [result]

Best Practices for robots.txt Compliance

1. Always Respect robots.txt

Even when technically possible to bypass robots.txt, respecting these directives maintains good relationships with website owners and avoids potential legal issues.

2. Check robots.txt Manually

Before starting a large crawling project, manually review the target website's robots.txt file:

curl https://example.com/robots.txt

3. Implement Rate Limiting

Even when robots.txt doesn't specify a crawl delay, implement your own rate limiting to avoid overwhelming servers, similar to handling browser sessions in Puppeteer for controlled access.

const result = await app.crawlUrl('https://example.com', {
  limit: 100,
  maxConcurrency: 2,  // Limit concurrent requests
  scrapeOptions: {
    formats: ['markdown']
  }
});

4. Use Appropriate User-Agent

Firecrawl automatically provides a proper user-agent, but ensure you're using the official SDK to maintain proper identification.

5. Monitor Crawl Results

Always check the results to ensure you're not hitting restricted areas:

async function auditCrawl(url) {
  const result = await app.crawlUrl(url, {
    limit: 100
  });

  // Check for any blocked or restricted URLs
  const blocked = result.data.filter(page =>
    page.statusCode === 403 || page.statusCode === 401
  );

  if (blocked.length > 0) {
    console.log('Some pages were blocked:', blocked);
  }

  return result;
}

Alternatives When Crawling is Restricted

If robots.txt prevents crawling but you still need data from a website:

1. Single Page Scraping

Use Firecrawl's scrape endpoint for individual pages:

const page = await app.scrapeUrl('https://example.com/specific-page', {
  formats: ['markdown', 'html']
});

2. Manual URL List

Provide specific URLs that are allowed by robots.txt:

const allowedUrls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

const results = await Promise.all(
  allowedUrls.map(url => app.scrapeUrl(url, {
    formats: ['markdown']
  }))
);

3. Contact Website Owner

For legitimate use cases, consider contacting the website owner to request access or permission for automated scraping.

Comparing with Other Tools

Unlike some web scraping tools that ignore robots.txt by default, Firecrawl's approach prioritizes ethical scraping. This is similar to how modern browser automation tools like those used in crawling single page applications can be configured to respect website policies.

Checking robots.txt Programmatically

You can also check robots.txt yourself before initiating a crawl:

const fetch = require('node-fetch');

async function checkRobotsTxt(domain) {
  try {
    const response = await fetch(`${domain}/robots.txt`);
    const content = await response.text();

    console.log('robots.txt content:');
    console.log(content);

    // Parse for specific directives
    const lines = content.split('\n');
    const disallowed = lines
      .filter(line => line.toLowerCase().startsWith('disallow:'))
      .map(line => line.split(':')[1].trim());

    return {
      exists: response.ok,
      disallowed: disallowed
    };
  } catch (error) {
    console.error('Error checking robots.txt:', error);
    return { exists: false, disallowed: [] };
  }
}

// Usage
checkRobotsTxt('https://example.com');

Conclusion

Firecrawl's respect for robots.txt files demonstrates its commitment to ethical web scraping practices. By default, the tool automatically detects and follows robots.txt directives, including disallow rules, crawl delays, and user-agent specifications. Developers using Firecrawl benefit from built-in compliance while maintaining the flexibility to configure crawling behavior through parameters like limits, include/exclude paths, and concurrency settings.

When working with websites that have strict robots.txt restrictions, consider alternative approaches such as single-page scraping, manual URL lists, or reaching out to website owners for permission. Always implement proper error handling and rate limiting to ensure your scraping activities remain respectful and compliant with both technical and ethical standards.

By following these best practices and leveraging Firecrawl's built-in robots.txt handling, you can build robust, ethical web scraping solutions that respect website owners' preferences while still gathering the data you need for your applications.

Table of contents