Table of contents

Can Firecrawl Extract Data from Multiple Pages Simultaneously?

Yes, Firecrawl can extract data from multiple pages simultaneously through its batch crawling capabilities and asynchronous job processing. This powerful feature allows developers to efficiently scrape large websites by crawling multiple URLs in parallel, significantly reducing the total time required for data extraction.

Understanding Firecrawl's Batch Crawling

Firecrawl provides a dedicated /crawl endpoint that automatically discovers and processes multiple pages from a website concurrently. Unlike single-page scraping, batch crawling handles the entire workflow of page discovery, parallel processing, and data aggregation.

How Concurrent Processing Works

When you initiate a crawl job, Firecrawl:

  1. Discovers URLs by following links from the starting URL
  2. Processes multiple pages in parallel using distributed workers
  3. Aggregates results into a single response
  4. Manages rate limiting to avoid overwhelming target servers

This architecture allows Firecrawl to handle dynamic content and SPAs efficiently across multiple pages without requiring you to manage the complexity of parallel browser instances.

Batch Crawling with Python

Here's how to crawl multiple pages simultaneously using the Firecrawl Python SDK:

from firecrawl import FirecrawlApp

# Initialize the Firecrawl client
app = FirecrawlApp(api_key='your_api_key')

# Start a batch crawl job
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'limit': 100,  # Maximum number of pages to crawl
        'scrapeOptions': {
            'formats': ['markdown', 'html'],
            'onlyMainContent': True
        }
    },
    wait_until_done=True  # Wait for all pages to complete
)

# Access results from all crawled pages
for page in crawl_result['data']:
    print(f"URL: {page['metadata']['url']}")
    print(f"Content: {page['markdown'][:200]}...")
    print("---")

Asynchronous Crawling for Better Performance

For improved performance and resource management, use asynchronous crawling:

import asyncio
from firecrawl import FirecrawlApp

async def crawl_website():
    app = FirecrawlApp(api_key='your_api_key')

    # Start crawl without waiting
    crawl_job = app.crawl_url(
        url='https://example.com',
        params={'limit': 50},
        wait_until_done=False
    )

    job_id = crawl_job['jobId']
    print(f"Crawl job started: {job_id}")

    # Poll for status
    while True:
        status = app.check_crawl_status(job_id)

        if status['status'] == 'completed':
            print(f"Crawled {status['total']} pages")
            return status['data']
        elif status['status'] == 'failed':
            raise Exception("Crawl failed")

        await asyncio.sleep(2)  # Check every 2 seconds

# Run the async function
results = asyncio.run(crawl_website())

Batch Crawling with JavaScript/Node.js

The JavaScript SDK provides similar functionality for concurrent page extraction:

const FirecrawlApp = require('@mendable/firecrawl-js').default;

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function crawlWebsite() {
    // Start batch crawl
    const crawlResult = await app.crawlUrl('https://example.com', {
        limit: 100,
        scrapeOptions: {
            formats: ['markdown', 'html'],
            onlyMainContent: true
        }
    }, true); // Wait for completion

    // Process all pages
    crawlResult.data.forEach(page => {
        console.log(`URL: ${page.metadata.url}`);
        console.log(`Title: ${page.metadata.title}`);
        console.log(`Content length: ${page.markdown.length}`);
        console.log('---');
    });

    return crawlResult;
}

crawlWebsite()
    .then(result => console.log(`Total pages crawled: ${result.data.length}`))
    .catch(error => console.error('Crawl error:', error));

Processing Results with Promise.all

For even more control, you can combine Firecrawl's crawl endpoint with JavaScript's parallel processing capabilities:

async function crawlAndProcess() {
    const app = new FirecrawlApp({ apiKey: 'your_api_key' });

    // Start crawl job
    const crawlJob = await app.crawlUrl('https://example.com', {
        limit: 50
    }, false); // Don't wait

    const jobId = crawlJob.jobId;
    console.log(`Started job: ${jobId}`);

    // Poll until complete
    let status;
    do {
        await new Promise(resolve => setTimeout(resolve, 2000));
        status = await app.checkCrawlStatus(jobId);
        console.log(`Status: ${status.status} - ${status.completed}/${status.total}`);
    } while (status.status === 'scraping');

    if (status.status === 'completed') {
        // Process all results in parallel
        const processedData = await Promise.all(
            status.data.map(async (page) => {
                // Custom processing for each page
                return {
                    url: page.metadata.url,
                    wordCount: page.markdown.split(/\s+/).length,
                    links: page.links || []
                };
            })
        );

        return processedData;
    }
}

Controlling Concurrency and Performance

Setting Crawl Limits

Control how many pages are processed simultaneously:

# Limit total pages
crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'limit': 50,  # Stop after 50 pages
        'maxDepth': 3  # Only crawl 3 levels deep
    }
)

URL Filtering and Patterns

Focus on specific pages using include/exclude patterns:

const crawlResult = await app.crawlUrl('https://example.com', {
    limit: 100,
    includePaths: ['/blog/*', '/products/*'],  // Only these paths
    excludePaths: ['/admin/*', '/private/*'],  // Skip these paths
    maxDepth: 2
});

Managing Rate Limits

Firecrawl automatically manages rate limiting, but you can control the pace:

crawl_result = app.crawl_url(
    url='https://example.com',
    params={
        'limit': 100,
        'scrapeOptions': {
            'waitFor': 1000  # Wait 1 second before scraping each page
        }
    }
)

Advanced: Custom Parallel Processing

For maximum control over parallel page processing, combine Firecrawl's single-page scrape endpoint with your own parallelization:

import asyncio
from firecrawl import FirecrawlApp
from concurrent.futures import ThreadPoolExecutor

def scrape_single_page(app, url):
    """Scrape a single page"""
    return app.scrape_url(url, params={'formats': ['markdown']})

async def scrape_multiple_urls(urls):
    """Scrape multiple URLs in parallel"""
    app = FirecrawlApp(api_key='your_api_key')

    # Use ThreadPoolExecutor for parallel requests
    with ThreadPoolExecutor(max_workers=5) as executor:
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(executor, scrape_single_page, app, url)
            for url in urls
        ]
        results = await asyncio.gather(*tasks)

    return results

# Scrape multiple specific URLs
urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    'https://example.com/page4',
    'https://example.com/page5'
]

results = asyncio.run(scrape_multiple_urls(urls))
print(f"Scraped {len(results)} pages successfully")

Monitoring Crawl Progress

Track the progress of large batch crawls:

async function monitorCrawl(jobId) {
    const app = new FirecrawlApp({ apiKey: 'your_api_key' });

    const checkInterval = setInterval(async () => {
        const status = await app.checkCrawlStatus(jobId);

        console.log(`Progress: ${status.completed}/${status.total} pages`);
        console.log(`Status: ${status.status}`);

        if (status.status === 'completed') {
            clearInterval(checkInterval);
            console.log('Crawl finished!');
            console.log(`Total pages: ${status.data.length}`);
        } else if (status.status === 'failed') {
            clearInterval(checkInterval);
            console.error('Crawl failed:', status.error);
        }
    }, 3000); // Check every 3 seconds
}

Best Practices for Multi-Page Extraction

  1. Use appropriate limits: Set realistic limit values to avoid unnecessary API usage
  2. Filter URLs strategically: Use includePaths and excludePaths to target relevant pages
  3. Monitor job status: For large crawls, poll the status endpoint rather than waiting synchronously
  4. Handle errors gracefully: Implement retry logic for failed pages
  5. Respect rate limits: Use Firecrawl's built-in rate limiting rather than implementing your own
  6. Process results incrementally: For very large crawls, process results as they become available

Comparison with Traditional Approaches

Firecrawl's concurrent processing offers significant advantages over traditional sequential scraping:

| Feature | Firecrawl Batch | Sequential Scraping | |---------|----------------|---------------------| | Speed | Parallel processing | One page at a time | | Infrastructure | Managed by Firecrawl | Self-hosted browsers | | Rate Limiting | Automatic | Manual implementation | | Error Handling | Built-in retries | Custom logic required | | Scalability | Unlimited | Limited by resources |

Conclusion

Firecrawl excels at extracting data from multiple pages simultaneously through its robust crawling infrastructure. Whether you're using the batch /crawl endpoint for automatic page discovery or implementing custom parallel processing with the /scrape endpoint, Firecrawl provides the tools needed for efficient, large-scale web data extraction.

The combination of asynchronous job processing, automatic rate limiting, and built-in error handling makes Firecrawl an excellent choice for projects requiring concurrent multi-page scraping without the complexity of managing browser instances and parallel workers yourself.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon