Table of contents

How do I use Firecrawl with Puppeteer for web scraping?

Firecrawl is a modern web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. While Firecrawl handles JavaScript rendering internally, you can also integrate it with Puppeteer to create powerful hybrid scraping workflows that combine Firecrawl's data extraction capabilities with Puppeteer's browser automation features.

Understanding Firecrawl and Puppeteer Integration

Firecrawl and Puppeteer serve complementary purposes in web scraping:

  • Firecrawl: Provides managed infrastructure for scraping, automatic JavaScript rendering, and intelligent data extraction
  • Puppeteer: Offers fine-grained browser control for complex interactions, authentication flows, and custom JavaScript execution

Combining both tools allows you to leverage Firecrawl's simplicity for data extraction while using Puppeteer for scenarios requiring custom browser automation.

Installation and Setup

Installing Dependencies

First, install both Firecrawl and Puppeteer:

npm install @mendable/firecrawl-js puppeteer

For Python projects:

pip install firecrawl-py puppeteer-python

Basic Configuration

Set up your Firecrawl API key as an environment variable:

export FIRECRAWL_API_KEY='your_api_key_here'

Approach 1: Using Firecrawl for Simple Scraping

For most web scraping tasks, Firecrawl alone is sufficient and simpler than using Puppeteer:

import FirecrawlApp from '@mendable/firecrawl-js';

const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

async function scrapeWithFirecrawl(url) {
  try {
    // Scrape a single page
    const result = await firecrawl.scrapeUrl(url, {
      formats: ['markdown', 'html'],
      onlyMainContent: true,
      waitFor: 2000  // Wait for JavaScript to render
    });

    console.log('Markdown content:', result.markdown);
    console.log('Metadata:', result.metadata);

    return result;
  } catch (error) {
    console.error('Scraping failed:', error);
  }
}

// Usage
scrapeWithFirecrawl('https://example.com');

Python equivalent:

from firecrawl import FirecrawlApp

firecrawl = FirecrawlApp(api_key='your_api_key')

def scrape_with_firecrawl(url):
    result = firecrawl.scrape_url(
        url,
        params={
            'formats': ['markdown', 'html'],
            'onlyMainContent': True,
            'waitFor': 2000
        }
    )

    print('Markdown:', result['markdown'])
    print('Metadata:', result['metadata'])

    return result

# Usage
scrape_with_firecrawl('https://example.com')

Approach 2: Puppeteer Pre-processing with Firecrawl Extraction

Use Puppeteer to handle complex authentication or interactions, then pass the resulting page to Firecrawl for data extraction:

import puppeteer from 'puppeteer';
import FirecrawlApp from '@mendable/firecrawl-js';

const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

async function scrapeWithAuthentication(loginUrl, targetUrl, credentials) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  try {
    // Navigate to login page
    await page.goto(loginUrl, { waitUntil: 'networkidle2' });

    // Perform login using Puppeteer
    await page.type('#username', credentials.username);
    await page.type('#password', credentials.password);
    await page.click('button[type="submit"]');

    // Wait for navigation after login
    await page.waitForNavigation({ waitUntil: 'networkidle2' });

    // Get cookies for authenticated session
    const cookies = await page.cookies();

    // Close browser
    await browser.close();

    // Use Firecrawl with authenticated session
    const result = await firecrawl.scrapeUrl(targetUrl, {
      formats: ['markdown'],
      headers: {
        'Cookie': cookies.map(c => `${c.name}=${c.value}`).join('; ')
      }
    });

    return result;
  } catch (error) {
    await browser.close();
    console.error('Error:', error);
  }
}

// Usage
scrapeWithAuthentication(
  'https://example.com/login',
  'https://example.com/protected-page',
  { username: 'user', password: 'pass' }
);

This approach is particularly useful when you need to handle authentication in Puppeteer before scraping protected content.

Approach 3: Parallel Processing with Both Tools

Leverage Puppeteer for browser-based tasks while using Firecrawl's crawling capabilities for site-wide data extraction:

import puppeteer from 'puppeteer';
import FirecrawlApp from '@mendable/firecrawl-js';

const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

async function hybridCrawl(baseUrl) {
  // Use Firecrawl to discover all pages
  const crawlResult = await firecrawl.crawlUrl(baseUrl, {
    limit: 100,
    scrapeOptions: {
      formats: ['markdown']
    }
  });

  // Launch Puppeteer for screenshot generation
  const browser = await puppeteer.launch();

  const results = [];

  for (const page of crawlResult.data) {
    // Get content from Firecrawl
    const content = page.markdown;

    // Use Puppeteer for screenshot
    const puppeteerPage = await browser.newPage();
    await puppeteerPage.goto(page.metadata.sourceURL);
    const screenshot = await puppeteerPage.screenshot({
      fullPage: true
    });
    await puppeteerPage.close();

    results.push({
      url: page.metadata.sourceURL,
      content: content,
      screenshot: screenshot
    });
  }

  await browser.close();
  return results;
}

// Usage
hybridCrawl('https://example.com');

Approach 4: Using Firecrawl's Built-in Actions (Recommended)

Firecrawl now supports browser actions natively, eliminating the need for Puppeteer in many cases:

import FirecrawlApp from '@mendable/firecrawl-js';

const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

async function scrapeWithActions(url) {
  const result = await firecrawl.scrapeUrl(url, {
    formats: ['markdown'],
    actions: [
      { type: 'wait', milliseconds: 2000 },
      { type: 'click', selector: 'button.load-more' },
      { type: 'wait', milliseconds: 1000 },
      { type: 'scroll', direction: 'down' },
      { type: 'screenshot' }
    ]
  });

  return result;
}

// Usage
scrapeWithActions('https://example.com/dynamic-content');

This approach handles many scenarios that previously required Puppeteer, such as waiting for elements, clicking buttons, and scrolling pages. Learn more about handling AJAX requests using Puppeteer to understand when browser automation is necessary.

Advanced Pattern: Custom JavaScript Execution

Combine Puppeteer's custom JavaScript injection with Firecrawl's extraction:

import puppeteer from 'puppeteer';
import FirecrawlApp from '@mendable/firecrawl-js';

async function scrapeWithCustomJS(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Execute custom JavaScript
  await page.evaluate(() => {
    // Remove ads and popups
    document.querySelectorAll('.ad, .popup').forEach(el => el.remove());

    // Trigger lazy-loaded content
    window.scrollTo(0, document.body.scrollHeight);
  });

  await page.waitForTimeout(2000);

  // Get the cleaned HTML
  const html = await page.content();
  await browser.close();

  // Use Firecrawl to parse the cleaned HTML
  const firecrawl = new FirecrawlApp({
    apiKey: process.env.FIRECRAWL_API_KEY
  });

  // Note: Firecrawl works with URLs, so you might need to use
  // the HTML content directly or save it temporarily
  const result = await firecrawl.scrapeUrl(url, {
    formats: ['markdown'],
    onlyMainContent: true
  });

  return result;
}

Handling Dynamic Content

For single-page applications and dynamic websites, you can use Firecrawl's waitFor option:

const result = await firecrawl.scrapeUrl('https://spa-example.com', {
  formats: ['markdown'],
  waitFor: 3000,  // Wait 3 seconds for JavaScript to render
  timeout: 30000  // Overall timeout
});

For more complex scenarios requiring specific element waits, you might want to understand how to crawl a single page application (SPA) using Puppeteer.

Error Handling and Retries

Implement robust error handling for both tools:

async function robustScrape(url, maxRetries = 3) {
  const firecrawl = new FirecrawlApp({
    apiKey: process.env.FIRECRAWL_API_KEY
  });

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await firecrawl.scrapeUrl(url, {
        formats: ['markdown'],
        timeout: 30000
      });

      return result;
    } catch (error) {
      console.error(`Attempt ${attempt} failed:`, error.message);

      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries} attempts`);
      }

      // Exponential backoff
      await new Promise(resolve =>
        setTimeout(resolve, 1000 * Math.pow(2, attempt))
      );
    }
  }
}

Performance Optimization

When processing multiple URLs, use concurrent requests with rate limiting:

import pLimit from 'p-limit';

async function batchScrape(urls) {
  const firecrawl = new FirecrawlApp({
    apiKey: process.env.FIRECRAWL_API_KEY
  });

  const limit = pLimit(5); // Limit to 5 concurrent requests

  const promises = urls.map(url =>
    limit(async () => {
      try {
        return await firecrawl.scrapeUrl(url, {
          formats: ['markdown']
        });
      } catch (error) {
        console.error(`Failed to scrape ${url}:`, error);
        return null;
      }
    })
  );

  return await Promise.all(promises);
}

// Usage
const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

batchScrape(urls).then(results => {
  console.log('Scraped pages:', results.filter(r => r !== null).length);
});

When to Use Which Tool

Use Firecrawl alone when: - You need clean, markdown-formatted content - The site doesn't require complex authentication - You want managed infrastructure and automatic retries - You need to crawl entire websites systematically

Use Puppeteer alone when: - You need complete control over browser behavior - You're performing complex UI testing - You need to interact with browser APIs directly

Use both together when: - You need custom authentication flows before scraping - You require pre-processing of pages before extraction - You want to combine screenshots with content extraction - You need to execute custom JavaScript before data extraction

Conclusion

While Firecrawl and Puppeteer can work together, Firecrawl's modern API often eliminates the need for Puppeteer in most web scraping scenarios. Firecrawl handles JavaScript rendering, provides clean markdown output, and includes built-in actions for common browser interactions. Reserve Puppeteer for cases requiring fine-grained browser control or complex authentication workflows.

For simpler scraping needs without the complexity of either tool, consider using a managed web scraping API that handles proxies, JavaScript rendering, and data extraction automatically.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon