Table of contents

How do I crawl an entire website with Firecrawl?

Firecrawl provides a powerful /crawl endpoint that allows you to crawl entire websites efficiently. Unlike single-page scraping, the crawl endpoint automatically discovers and follows links, respects robots.txt, and returns structured data from all pages on a domain. This guide covers everything you need to know about crawling entire websites with Firecrawl.

Understanding Firecrawl's Crawl Endpoint

The Firecrawl /crawl endpoint is designed specifically for crawling multiple pages from a website. When you submit a crawl request, Firecrawl:

  • Starts from the provided URL
  • Automatically discovers links on the page
  • Follows internal links within the same domain
  • Converts HTML to clean markdown format
  • Extracts metadata from each page
  • Returns structured data for all discovered pages

This makes it ideal for tasks like documentation scraping, content migration, SEO analysis, and building knowledge bases from entire websites.

Basic Website Crawl with Python

Here's how to crawl an entire website using Firecrawl's Python SDK:

from firecrawl import FirecrawlApp

# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key_here')

# Start crawling a website
crawl_result = app.crawl_url('https://example.com', params={
    'crawlerOptions': {
        'limit': 100,  # Maximum number of pages to crawl
    }
})

# Process the results
if crawl_result['success']:
    for page in crawl_result['data']:
        print(f"URL: {page['url']}")
        print(f"Title: {page['metadata']['title']}")
        print(f"Content: {page['markdown'][:200]}...")
        print("---")

This basic example will crawl up to 100 pages starting from the specified URL, following internal links automatically.

Crawling Websites with JavaScript/Node.js

For JavaScript developers, Firecrawl offers a Node.js SDK with similar functionality:

const FirecrawlApp = require('@mendable/firecrawl-js').default;

async function crawlWebsite() {
    const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

    try {
        const crawlResult = await app.crawlUrl('https://example.com', {
            crawlerOptions: {
                limit: 100,
                excludePaths: ['admin/*', 'api/*'],
                includePaths: ['docs/*', 'blog/*']
            }
        });

        console.log(`Crawled ${crawlResult.data.length} pages`);

        crawlResult.data.forEach(page => {
            console.log(`URL: ${page.url}`);
            console.log(`Title: ${page.metadata.title}`);
            console.log('---');
        });
    } catch (error) {
        console.error('Crawl failed:', error);
    }
}

crawlWebsite();

Advanced Crawl Configuration Options

Firecrawl provides extensive configuration options to control crawl behavior:

Limiting Crawl Scope

crawl_params = {
    'crawlerOptions': {
        'limit': 50,  # Maximum pages to crawl
        'maxDepth': 3,  # Maximum link depth from starting URL
        'excludePaths': [
            'admin/*',
            'login',
            '*/comments/*'
        ],
        'includePaths': [
            'blog/*',
            'docs/*'
        ]
    }
}

result = app.crawl_url('https://example.com', params=crawl_params)

Controlling Crawl Speed and Concurrency

const crawlOptions = {
    crawlerOptions: {
        limit: 200,
        maxConcurrency: 5,  // Number of concurrent requests
        delay: 1000  // Delay between requests in milliseconds
    }
};

const result = await app.crawlUrl('https://example.com', crawlOptions);

Asynchronous Crawling for Large Websites

For crawling large websites, Firecrawl supports asynchronous crawling where you start a crawl job and poll for results:

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='your_api_key_here')

# Start an async crawl
crawl_job = app.async_crawl_url('https://example.com', params={
    'crawlerOptions': {
        'limit': 500
    }
})

job_id = crawl_job['jobId']
print(f"Crawl job started: {job_id}")

# Poll for completion
while True:
    status = app.check_crawl_status(job_id)

    if status['status'] == 'completed':
        print(f"Crawl completed! Found {len(status['data'])} pages")

        # Process results
        for page in status['data']:
            print(f"Processing: {page['url']}")
        break
    elif status['status'] == 'failed':
        print("Crawl failed:", status.get('error'))
        break
    else:
        print(f"Status: {status['status']}, Progress: {status.get('progress', 0)}%")
        time.sleep(10)  # Wait 10 seconds before checking again

Similar to how to handle browser sessions in Puppeteer, managing long-running crawl sessions requires careful state management and error handling.

JavaScript Async Crawl Example

const FirecrawlApp = require('@mendable/firecrawl-js').default;

async function asyncCrawl() {
    const app = new FirecrawlApp({ apiKey: 'your_api_key_here' });

    // Start the crawl
    const crawlJob = await app.asyncCrawlUrl('https://example.com', {
        crawlerOptions: {
            limit: 500,
            includePaths: ['docs/*']
        }
    });

    console.log(`Job ID: ${crawlJob.jobId}`);

    // Poll for results
    const checkStatus = async () => {
        const status = await app.checkCrawlStatus(crawlJob.jobId);

        if (status.status === 'completed') {
            console.log(`Crawl completed with ${status.data.length} pages`);
            return status.data;
        } else if (status.status === 'failed') {
            throw new Error(`Crawl failed: ${status.error}`);
        } else {
            console.log(`Status: ${status.status}`);
            await new Promise(resolve => setTimeout(resolve, 10000));
            return checkStatus();
        }
    };

    const results = await checkStatus();
    return results;
}

asyncCrawl().catch(console.error);

Filtering and Targeting Specific Content

You can use path patterns to crawl only specific sections of a website:

# Crawl only blog posts and documentation
params = {
    'crawlerOptions': {
        'includePaths': [
            'blog/*/posts/*',
            'docs/**'
        ],
        'excludePaths': [
            '*/draft/*',
            '*/preview/*'
        ],
        'limit': 300
    }
}

result = app.crawl_url('https://example.com', params=params)

Extracting Structured Data During Crawl

Firecrawl can extract structured data from pages during the crawl process:

const crawlOptions = {
    crawlerOptions: {
        limit: 100
    },
    pageOptions: {
        onlyMainContent: true,
        includeHtml: false,
        screenshot: false
    },
    extractorOptions: {
        mode: 'llm-extraction',
        extractionSchema: {
            type: 'object',
            properties: {
                title: { type: 'string' },
                author: { type: 'string' },
                publishDate: { type: 'string' },
                tags: {
                    type: 'array',
                    items: { type: 'string' }
                }
            }
        }
    }
};

const result = await app.crawlUrl('https://blog.example.com', crawlOptions);

result.data.forEach(page => {
    console.log('Extracted data:', page.extractedData);
});

Handling JavaScript-Rendered Content

Many modern websites rely heavily on JavaScript to render content. When crawling single-page applications, you need to ensure JavaScript executes before extracting content:

params = {
    'crawlerOptions': {
        'limit': 50
    },
    'pageOptions': {
        'waitFor': 2000,  # Wait 2 seconds for JavaScript to execute
        'screenshot': False
    }
}

result = app.crawl_url('https://spa-example.com', params=params)

Respecting Robots.txt and Rate Limiting

Firecrawl automatically respects robots.txt directives by default. You can also configure rate limiting to be a good web citizen:

const crawlOptions = {
    crawlerOptions: {
        limit: 200,
        respectRobotsTxt: true,
        delay: 2000,  // 2 second delay between requests
        maxConcurrency: 3  // Limit concurrent requests
    }
};

const result = await app.crawlUrl('https://example.com', crawlOptions);

Monitoring Crawl Progress

For large crawls, monitoring progress is essential:

import time
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

# Start crawl
job = app.async_crawl_url('https://example.com', params={
    'crawlerOptions': {'limit': 1000}
})

# Monitor progress
while True:
    status = app.check_crawl_status(job['jobId'])

    completed = status.get('completed', 0)
    total = status.get('total', 0)

    if total > 0:
        progress = (completed / total) * 100
        print(f"Progress: {completed}/{total} ({progress:.1f}%)")

    if status['status'] == 'completed':
        print("Crawl finished successfully!")
        break
    elif status['status'] == 'failed':
        print(f"Crawl failed: {status.get('error')}")
        break

    time.sleep(5)

Error Handling and Retry Logic

Implement robust error handling for production crawls:

async function robustCrawl(url, maxRetries = 3) {
    const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            const result = await app.crawlUrl(url, {
                crawlerOptions: {
                    limit: 200,
                    timeout: 30000
                }
            });

            console.log(`Successfully crawled ${result.data.length} pages`);
            return result;

        } catch (error) {
            console.error(`Attempt ${attempt} failed:`, error.message);

            if (attempt < maxRetries) {
                const delay = Math.pow(2, attempt) * 1000;
                console.log(`Retrying in ${delay}ms...`);
                await new Promise(resolve => setTimeout(resolve, delay));
            } else {
                throw error;
            }
        }
    }
}

Similar to handling timeouts in Puppeteer, proper timeout and retry configuration ensures reliable crawling operations.

Saving Crawl Results

Store crawl results for later processing:

import json
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key_here')

result = app.crawl_url('https://example.com', params={
    'crawlerOptions': {'limit': 100}
})

# Save to JSON file
with open('crawl_results.json', 'w', encoding='utf-8') as f:
    json.dump(result['data'], f, indent=2, ensure_ascii=False)

# Save individual markdown files
import os
os.makedirs('crawled_pages', exist_ok=True)

for i, page in enumerate(result['data']):
    filename = f"crawled_pages/page_{i:04d}.md"
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(f"# {page['metadata'].get('title', 'Untitled')}\n\n")
        f.write(f"URL: {page['url']}\n\n")
        f.write(page['markdown'])

Best Practices for Website Crawling

  1. Start Small: Test with a small limit first, then scale up
  2. Use Path Filters: Exclude unnecessary sections like admin panels, login pages, and APIs
  3. Respect Rate Limits: Configure appropriate delays to avoid overwhelming servers
  4. Monitor Costs: Each page crawled consumes API credits
  5. Handle Errors Gracefully: Implement retry logic and proper error handling
  6. Store Results Efficiently: Save data in structured formats for easy processing
  7. Check Robots.txt: Ensure you're respecting website crawling policies

Conclusion

Firecrawl's crawl endpoint provides a powerful, developer-friendly way to crawl entire websites without managing complex infrastructure. By leveraging its automatic link discovery, JavaScript rendering, and structured data extraction, you can build robust web scraping solutions that scale from small blogs to large enterprise websites.

Whether you're building a search engine, migrating content, or conducting SEO analysis, Firecrawl's crawling capabilities offer a reliable foundation for your web data extraction needs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon