Table of contents

Can I use Firecrawl for batch web scraping tasks?

Yes, Firecrawl is excellent for batch web scraping tasks and provides multiple approaches to efficiently scrape large numbers of URLs. Whether you need to scrape hundreds of product pages, process multiple domains, or extract data from thousands of articles, Firecrawl offers flexible batch processing capabilities through its API and SDKs.

Batch web scraping refers to the process of extracting data from multiple web pages in an automated, systematic way. Unlike scraping individual pages one at a time, batch processing allows you to handle large volumes of URLs efficiently, making it ideal for data aggregation, competitive analysis, price monitoring, content migration, and market research.

Understanding Batch Scraping with Firecrawl

Firecrawl supports batch web scraping through two primary methods:

  1. Crawl Mode: Automatically discover and scrape multiple pages from a single domain
  2. Batch Scrape Mode: Submit multiple specific URLs for concurrent processing

Both approaches leverage Firecrawl's headless browser capabilities, JavaScript rendering, and AI-powered data extraction to handle modern websites at scale.

Batch Scraping Multiple URLs with Python

The most straightforward approach to batch scraping is to loop through a list of URLs and scrape them individually. Here's a basic implementation:

from firecrawl import FirecrawlApp
import time

# Initialize Firecrawl
app = FirecrawlApp(api_key='your_api_key')

# List of URLs to scrape
urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3',
    'https://example.com/product/4',
    'https://example.com/product/5'
]

results = []

for url in urls:
    try:
        result = app.scrape_url(url, params={
            'formats': ['markdown', 'html']
        })
        results.append({
            'url': url,
            'data': result
        })
        print(f"✓ Scraped: {url}")

        # Add delay to respect rate limits
        time.sleep(1)

    except Exception as e:
        print(f"✗ Failed to scrape {url}: {str(e)}")

print(f"\nCompleted scraping {len(results)} URLs")

Parallel Batch Scraping with JavaScript

For better performance, you can scrape multiple URLs concurrently using JavaScript's async/await pattern:

const FirecrawlApp = require('@mendable/firecrawl-js').default;

async function batchScrape(urls, concurrency = 5) {
    const app = new FirecrawlApp({ apiKey: 'your_api_key' });
    const results = [];

    // Process URLs in batches
    for (let i = 0; i < urls.length; i += concurrency) {
        const batch = urls.slice(i, i + concurrency);

        const promises = batch.map(async (url) => {
            try {
                const result = await app.scrapeUrl(url, {
                    formats: ['markdown', 'html']
                });
                console.log(`✓ Scraped: ${url}`);
                return { url, success: true, data: result };
            } catch (error) {
                console.error(`✗ Failed: ${url} - ${error.message}`);
                return { url, success: false, error: error.message };
            }
        });

        const batchResults = await Promise.all(promises);
        results.push(...batchResults);

        // Add delay between batches
        if (i + concurrency < urls.length) {
            await new Promise(resolve => setTimeout(resolve, 2000));
        }
    }

    return results;
}

// Example usage
const productUrls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3',
    'https://example.com/product/4',
    'https://example.com/product/5'
];

batchScrape(productUrls, 3).then(results => {
    const successful = results.filter(r => r.success).length;
    console.log(`\nScraped ${successful}/${results.length} URLs successfully`);
});

This approach processes URLs in controlled batches, similar to running multiple pages in parallel with Puppeteer, which helps manage system resources and respect API rate limits.

Advanced Batch Scraping with Structured Data Extraction

One of Firecrawl's most powerful features for batch scraping is the ability to extract structured data from multiple pages using a consistent schema:

from firecrawl import FirecrawlApp
from concurrent.futures import ThreadPoolExecutor, as_completed

app = FirecrawlApp(api_key='your_api_key')

# Define extraction schema
product_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "description": {"type": "string"},
        "rating": {"type": "number"},
        "availability": {"type": "string"},
        "images": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "price"]
}

def scrape_product(url):
    """Scrape a single product page with structured extraction"""
    try:
        result = app.scrape_url(url, params={
            'formats': ['extract'],
            'extract': {
                'schema': product_schema
            }
        })
        return {
            'url': url,
            'success': True,
            'product': result.get('extract')
        }
    except Exception as e:
        return {
            'url': url,
            'success': False,
            'error': str(e)
        }

# Batch scrape with threading
product_urls = [
    'https://shop.example.com/laptop-pro',
    'https://shop.example.com/wireless-mouse',
    'https://shop.example.com/mechanical-keyboard',
    # ... more URLs
]

products = []

with ThreadPoolExecutor(max_workers=5) as executor:
    # Submit all scraping tasks
    future_to_url = {executor.submit(scrape_product, url): url for url in product_urls}

    # Collect results as they complete
    for future in as_completed(future_to_url):
        result = future.result()
        products.append(result)

        if result['success']:
            print(f"✓ Extracted: {result['product'].get('name', 'Unknown')}")
        else:
            print(f"✗ Failed: {result['url']}")

# Save results
import json
with open('batch_products.json', 'w') as f:
    json.dump(products, f, indent=2)

print(f"\nExtracted {len([p for p in products if p['success']])} products")

Crawling for Automated Batch Discovery

Instead of manually specifying URLs, you can use Firecrawl's crawl feature to automatically discover and scrape pages in batches:

const FirecrawlApp = require('@mendable/firecrawl-js').default;

async function batchCrawlAndExtract() {
    const app = new FirecrawlApp({ apiKey: 'your_api_key' });

    // Crawl multiple sections of a website
    const crawlResult = await app.crawlUrl('https://example.com', {
        crawlerOptions: {
            limit: 200,
            includePaths: [
                'products/*',
                'categories/*'
            ],
            excludePaths: [
                '*/reviews',
                '*/comments'
            ]
        },
        pageOptions: {
            onlyMainContent: true
        }
    });

    console.log(`Discovered and scraped ${crawlResult.data.length} pages`);

    // Process results in batches
    const batchSize = 10;
    for (let i = 0; i < crawlResult.data.length; i += batchSize) {
        const batch = crawlResult.data.slice(i, i + batchSize);

        batch.forEach(page => {
            console.log(`Page: ${page.metadata.title}`);
            console.log(`URL: ${page.url}`);
            console.log(`Content length: ${page.markdown.length} chars`);
            console.log('---');
        });
    }

    return crawlResult.data;
}

batchCrawlAndExtract();

Async Batch Processing for Large-Scale Operations

For large-scale batch scraping involving hundreds or thousands of URLs, use asynchronous processing to avoid timeouts:

from firecrawl import FirecrawlApp
import time
import json

app = FirecrawlApp(api_key='your_api_key')

def async_batch_scrape(urls, batch_name="batch"):
    """
    Start an async crawl job for batch processing
    """
    # For batch processing, you can use the crawl endpoint
    # or implement custom async handling

    results = []
    batch_size = 50

    for i in range(0, len(urls), batch_size):
        batch = urls[i:i + batch_size]
        batch_num = i // batch_size + 1

        print(f"Processing batch {batch_num}/{(len(urls) + batch_size - 1) // batch_size}")

        for url in batch:
            try:
                result = app.scrape_url(url, params={
                    'formats': ['markdown'],
                    'onlyMainContent': True
                })

                results.append({
                    'url': url,
                    'title': result.get('metadata', {}).get('title', ''),
                    'content': result.get('markdown', '')
                })

            except Exception as e:
                print(f"Error scraping {url}: {str(e)}")
                results.append({
                    'url': url,
                    'error': str(e)
                })

        # Save intermediate results
        with open(f'{batch_name}_batch_{batch_num}.json', 'w') as f:
            json.dump(results, f, indent=2)

        print(f"Saved batch {batch_num} with {len(results)} items")

        # Rate limiting
        time.sleep(2)

    return results

# Example with large URL list
urls_to_scrape = [
    f'https://example.com/article/{i}' for i in range(1, 501)
]

all_results = async_batch_scrape(urls_to_scrape, "articles")
print(f"Total scraped: {len(all_results)} URLs")

Batch Scraping with Custom Rate Limiting

To avoid overwhelming servers or hitting API rate limits, implement intelligent rate limiting:

class BatchScraper {
    constructor(apiKey, rateLimit = 10) {
        this.app = new FirecrawlApp({ apiKey });
        this.rateLimit = rateLimit; // requests per minute
        this.requestQueue = [];
        this.results = [];
    }

    async scrapeWithRateLimit(url) {
        return new Promise((resolve) => {
            this.requestQueue.push({ url, resolve });
            this.processQueue();
        });
    }

    async processQueue() {
        if (this.processing || this.requestQueue.length === 0) {
            return;
        }

        this.processing = true;
        const delayBetweenRequests = (60 * 1000) / this.rateLimit;

        while (this.requestQueue.length > 0) {
            const { url, resolve } = this.requestQueue.shift();

            try {
                const result = await this.app.scrapeUrl(url, {
                    formats: ['markdown']
                });

                this.results.push({ url, success: true, data: result });
                console.log(`✓ Scraped (${this.results.length}): ${url}`);
                resolve(result);

            } catch (error) {
                this.results.push({ url, success: false, error: error.message });
                console.error(`✗ Failed: ${url}`);
                resolve(null);
            }

            // Wait before next request
            if (this.requestQueue.length > 0) {
                await new Promise(r => setTimeout(r, delayBetweenRequests));
            }
        }

        this.processing = false;
    }

    async batchScrape(urls) {
        const promises = urls.map(url => this.scrapeWithRateLimit(url));
        await Promise.all(promises);
        return this.results;
    }
}

// Usage
const scraper = new BatchScraper('your_api_key', 30); // 30 requests per minute

const urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    // ... more URLs
];

scraper.batchScrape(urls).then(results => {
    console.log(`Batch complete: ${results.filter(r => r.success).length} successful`);
});

Batch Processing with Error Recovery

Implement robust error handling and retry logic for production batch scraping:

from firecrawl import FirecrawlApp
import time
import json
from typing import List, Dict

class BatchScraper:
    def __init__(self, api_key: str, max_retries: int = 3):
        self.app = FirecrawlApp(api_key=api_key)
        self.max_retries = max_retries
        self.failed_urls = []

    def scrape_with_retry(self, url: str, params: Dict = None) -> Dict:
        """Scrape a URL with automatic retry on failure"""
        params = params or {'formats': ['markdown']}

        for attempt in range(self.max_retries):
            try:
                result = self.app.scrape_url(url, params=params)
                return {
                    'url': url,
                    'success': True,
                    'data': result,
                    'attempts': attempt + 1
                }
            except Exception as e:
                if attempt < self.max_retries - 1:
                    wait_time = 2 ** attempt  # Exponential backoff
                    print(f"Retry {attempt + 1} for {url} after {wait_time}s")
                    time.sleep(wait_time)
                else:
                    self.failed_urls.append(url)
                    return {
                        'url': url,
                        'success': False,
                        'error': str(e),
                        'attempts': attempt + 1
                    }

    def batch_scrape(self, urls: List[str], checkpoint_interval: int = 50) -> List[Dict]:
        """Scrape multiple URLs with checkpointing"""
        results = []

        for i, url in enumerate(urls, 1):
            result = self.scrape_with_retry(url)
            results.append(result)

            if result['success']:
                print(f"[{i}/{len(urls)}] ✓ {url}")
            else:
                print(f"[{i}/{len(urls)}] ✗ {url}: {result.get('error', 'Unknown error')}")

            # Save checkpoint
            if i % checkpoint_interval == 0:
                checkpoint_file = f'checkpoint_{i}.json'
                with open(checkpoint_file, 'w') as f:
                    json.dump(results, f, indent=2)
                print(f"Checkpoint saved: {checkpoint_file}")

            time.sleep(0.5)  # Rate limiting

        return results

    def retry_failed(self) -> List[Dict]:
        """Retry all failed URLs"""
        print(f"\nRetrying {len(self.failed_urls)} failed URLs...")
        retry_results = []

        for url in self.failed_urls[:]:
            result = self.scrape_with_retry(url)
            retry_results.append(result)

            if result['success']:
                self.failed_urls.remove(url)

        return retry_results

# Example usage
scraper = BatchScraper(api_key='your_api_key', max_retries=3)

urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
results = scraper.batch_scrape(urls, checkpoint_interval=25)

# Retry failed URLs
if scraper.failed_urls:
    retry_results = scraper.retry_failed()
    results.extend(retry_results)

# Summary
successful = sum(1 for r in results if r['success'])
print(f"\nBatch scraping complete: {successful}/{len(urls)} successful")

Batch Scraping with AI-Powered Data Extraction

Combine batch processing with Firecrawl's AI extraction for intelligent data gathering, similar to how you might handle AJAX requests using Puppeteer but with built-in AI capabilities:

const FirecrawlApp = require('@mendable/firecrawl-js').default;
const fs = require('fs').promises;

async function intelligentBatchExtraction(urls) {
    const app = new FirecrawlApp({ apiKey: 'your_api_key' });

    const extractionPrompt = `
        Extract the following information from this page:
        - Main title or headline
        - Author name (if available)
        - Publication date
        - Main content summary (2-3 sentences)
        - Key topics or categories
        - Any prices or numerical data mentioned
    `;

    const results = [];

    for (const url of urls) {
        try {
            const result = await app.scrapeUrl(url, {
                formats: ['extract'],
                extract: {
                    prompt: extractionPrompt
                }
            });

            results.push({
                url,
                success: true,
                extracted: result.extract
            });

            console.log(`✓ Extracted from: ${url}`);

        } catch (error) {
            results.push({
                url,
                success: false,
                error: error.message
            });
            console.error(`✗ Failed: ${url}`);
        }

        // Rate limiting
        await new Promise(resolve => setTimeout(resolve, 1000));
    }

    // Save to file
    await fs.writeFile(
        'batch_extraction_results.json',
        JSON.stringify(results, null, 2)
    );

    return results;
}

// Example URLs
const articleUrls = [
    'https://blog.example.com/article-1',
    'https://blog.example.com/article-2',
    'https://blog.example.com/article-3'
];

intelligentBatchExtraction(articleUrls).then(results => {
    const successful = results.filter(r => r.success).length;
    console.log(`\nExtracted data from ${successful}/${results.length} URLs`);
});

Best Practices for Batch Web Scraping with Firecrawl

1. Implement Proper Rate Limiting

Respect both Firecrawl's API limits and target website servers:

import time

def rate_limited_batch(urls, requests_per_minute=30):
    delay = 60 / requests_per_minute

    for url in urls:
        result = scrape_url(url)
        process_result(result)
        time.sleep(delay)

2. Use Checkpointing for Long-Running Jobs

Save intermediate results to prevent data loss:

async function checkpointedBatch(urls, checkpointFile) {
    let completed = [];

    // Load previous checkpoint if exists
    try {
        const data = await fs.readFile(checkpointFile, 'utf8');
        completed = JSON.parse(data);
    } catch (error) {
        // No checkpoint exists, start fresh
    }

    const remaining = urls.filter(url =>
        !completed.some(c => c.url === url)
    );

    for (const url of remaining) {
        const result = await scrapeUrl(url);
        completed.push(result);

        // Save checkpoint after each URL
        await fs.writeFile(
            checkpointFile,
            JSON.stringify(completed, null, 2)
        );
    }

    return completed;
}

3. Monitor and Log Progress

Track your batch scraping progress for debugging and optimization:

import logging
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f'batch_scrape_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'),
        logging.StreamHandler()
    ]
)

def batch_scrape_with_logging(urls):
    logging.info(f"Starting batch scrape of {len(urls)} URLs")

    for i, url in enumerate(urls, 1):
        try:
            result = scrape_url(url)
            logging.info(f"[{i}/{len(urls)}] Success: {url}")
        except Exception as e:
            logging.error(f"[{i}/{len(urls)}] Failed: {url} - {str(e)}")

    logging.info("Batch scrape completed")

4. Handle Different Content Types

Adapt your scraping strategy based on page types:

def adaptive_batch_scrape(urls_with_types):
    """
    urls_with_types: [{'url': 'https://...', 'type': 'product'}, ...]
    """
    schemas = {
        'product': product_schema,
        'article': article_schema,
        'profile': profile_schema
    }

    for item in urls_with_types:
        url = item['url']
        page_type = item['type']
        schema = schemas.get(page_type)

        result = app.scrape_url(url, params={
            'formats': ['extract'],
            'extract': {'schema': schema}
        })

        process_result(result, page_type)

5. Optimize for Cost Efficiency

Minimize API costs while maintaining quality:

async function costEffectiveBatch(urls) {
    // Group similar pages together
    const groupedUrls = groupBySimilarity(urls);

    for (const [pattern, urlGroup] of Object.entries(groupedUrls)) {
        // Use same schema for similar pages
        const schema = generateSchemaForPattern(pattern);

        for (const url of urlGroup) {
            const result = await app.scrapeUrl(url, {
                formats: ['extract'],
                extract: { schema },
                onlyMainContent: true  // Reduce token usage
            });

            processResult(result);
        }
    }
}

Common Batch Scraping Use Cases

Price Monitoring Across Multiple Stores

competitor_products = [
    'https://store1.com/product/widget',
    'https://store2.com/product/widget',
    'https://store3.com/product/widget'
]

price_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "in_stock": {"type": "boolean"}
    }
}

prices = batch_scrape_with_schema(competitor_products, price_schema)

Content Migration

// Migrate all blog posts from old site
const blogUrls = await getAllBlogUrls('https://old-blog.com');

const migratedContent = await batchScrape(blogUrls, {
    formats: ['markdown', 'html'],
    onlyMainContent: true
});

// Save to new CMS
await importToNewCMS(migratedContent);

SEO Auditing

sitemap_urls = get_urls_from_sitemap('https://example.com/sitemap.xml')

seo_data = []
for url in sitemap_urls:
    result = app.scrape_url(url, params={
        'formats': ['html', 'markdown']
    })

    seo_data.append({
        'url': url,
        'title': result['metadata'].get('title'),
        'description': result['metadata'].get('description'),
        'word_count': len(result['markdown'].split())
    })

Conclusion

Firecrawl provides robust capabilities for batch web scraping tasks, whether you need to process a small list of specific URLs or crawl entire websites. By leveraging its API with proper rate limiting, error handling, and structured data extraction, you can build efficient batch scraping solutions that scale from dozens to thousands of pages.

The key to successful batch scraping with Firecrawl is combining its powerful features—JavaScript rendering, AI extraction, and concurrent processing—with best practices like checkpointing, retry logic, and intelligent rate limiting. This approach ensures reliable, cost-effective data extraction at scale for any batch web scraping project.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon