Table of contents

How do I use Firecrawl with Playwright for Browser Automation?

While Firecrawl is a powerful web scraping and crawling API that handles JavaScript-rendered content out of the box, there are scenarios where you might want to combine it with Playwright for advanced browser automation tasks. This guide explores different integration patterns, use cases, and practical implementations.

Understanding Firecrawl and Playwright

Firecrawl is a web scraping API that converts websites into clean, LLM-ready markdown or structured data. It handles JavaScript rendering, dynamic content, and provides features like sitemap crawling and automatic content extraction.

Playwright is a browser automation framework that provides low-level control over Chromium, Firefox, and WebKit browsers. It excels at complex interactions, custom JavaScript execution, and scenarios requiring fine-grained browser control.

When to Combine Firecrawl with Playwright

Before diving into integration, consider these use cases where combining both tools makes sense:

  1. Pre-authentication with Playwright: Use Playwright to handle complex OAuth flows or multi-step authentication, then pass authenticated session data to Firecrawl
  2. Complex User Interactions: Perform intricate user interactions with Playwright before scraping with Firecrawl
  3. Hybrid Workflows: Use Playwright for dynamic navigation and Firecrawl for efficient bulk content extraction
  4. Custom JavaScript Injection: Execute custom scripts with Playwright, then leverage Firecrawl's parsing capabilities

Basic Integration Pattern

Here's a fundamental pattern for using Firecrawl alongside Playwright:

Python Implementation

from playwright.sync_api import sync_playwright
from firecrawl import FirecrawlApp
import time

# Initialize Firecrawl
firecrawl = FirecrawlApp(api_key='your_api_key_here')

def scrape_with_playwright_and_firecrawl(url):
    with sync_playwright() as p:
        # Launch browser with Playwright
        browser = p.chromium.launch(headless=True)
        context = browser.new_context()
        page = context.new_page()

        # Navigate and perform complex interactions
        page.goto(url)
        page.wait_for_load_state('networkidle')

        # Example: Handle authentication or complex navigation
        page.click('button#login')
        page.fill('input[name="username"]', 'your_username')
        page.fill('input[name="password"]', 'your_password')
        page.click('button[type="submit"]')
        page.wait_for_selector('.dashboard')

        # Get cookies from authenticated session
        cookies = context.cookies()

        # Close Playwright
        browser.close()

        # Now use Firecrawl with authenticated context
        result = firecrawl.scrape_url(url, {
            'formats': ['markdown', 'html'],
            'headers': {
                'Cookie': '; '.join([f"{c['name']}={c['value']}" for c in cookies])
            }
        })

        return result

# Usage
data = scrape_with_playwright_and_firecrawl('https://example.com')
print(data['markdown'])

JavaScript/Node.js Implementation

const { chromium } = require('playwright');
const FirecrawlApp = require('@mendable/firecrawl-js').default;

const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key_here' });

async function scrapeWithPlaywrightAndFirecrawl(url) {
    // Launch Playwright browser
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();

    try {
        // Navigate and perform interactions
        await page.goto(url, { waitUntil: 'networkidle' });

        // Example: Complex interaction
        await page.click('button#load-more');
        await page.waitForSelector('.dynamic-content');

        // Get the final URL after redirects/navigation
        const finalUrl = page.url();

        // Extract cookies for authenticated requests
        const cookies = await context.cookies();
        const cookieHeader = cookies.map(c => `${c.name}=${c.value}`).join('; ');

        await browser.close();

        // Use Firecrawl to scrape the content
        const result = await firecrawl.scrapeUrl(finalUrl, {
            formats: ['markdown', 'html'],
            headers: {
                'Cookie': cookieHeader
            }
        });

        return result;
    } catch (error) {
        await browser.close();
        throw error;
    }
}

// Usage
scrapeWithPlaywrightAndFirecrawl('https://example.com')
    .then(data => console.log(data.markdown))
    .catch(error => console.error(error));

Advanced Use Cases

1. Session Management and Authentication

For sites requiring complex authentication flows, similar to handling authentication in Puppeteer:

from playwright.sync_api import sync_playwright
from firecrawl import FirecrawlApp
import json

def authenticate_and_scrape(login_url, target_url, credentials):
    firecrawl = FirecrawlApp(api_key='your_api_key')

    with sync_playwright() as p:
        browser = p.chromium.launch()
        context = browser.new_context()
        page = context.new_page()

        # Navigate to login page
        page.goto(login_url)

        # Handle multi-step authentication
        page.fill('#username', credentials['username'])
        page.click('#next-button')
        page.wait_for_selector('#password')
        page.fill('#password', credentials['password'])
        page.click('#login-button')

        # Wait for authentication to complete
        page.wait_for_url('**/dashboard**')

        # Extract session storage and cookies
        cookies = context.cookies()
        storage_state = context.storage_state()

        browser.close()

        # Convert cookies to header format
        cookie_header = '; '.join([f"{c['name']}={c['value']}" for c in cookies])

        # Use Firecrawl with authenticated session
        results = firecrawl.crawl_url(target_url, {
            'limit': 100,
            'headers': {
                'Cookie': cookie_header
            }
        })

        return results

# Usage
data = authenticate_and_scrape(
    'https://example.com/login',
    'https://example.com/protected-area',
    {'username': 'user', 'password': 'pass'}
)

2. Dynamic Content Loading

When dealing with infinite scroll or dynamic content loading, you can use Playwright to trigger content loading before letting Firecrawl extract the data:

const { chromium } = require('playwright');
const FirecrawlApp = require('@mendable/firecrawl-js').default;

async function scrapeInfiniteScroll(url) {
    const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key' });
    const browser = await chromium.launch();
    const page = await browser.newPage();

    await page.goto(url, { waitUntil: 'networkidle' });

    // Scroll to load all content
    let previousHeight = 0;
    while (true) {
        const currentHeight = await page.evaluate(() => document.body.scrollHeight);
        if (currentHeight === previousHeight) break;

        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(2000);
        previousHeight = currentHeight;
    }

    // Get the fully loaded HTML
    const html = await page.content();
    await browser.close();

    // Now use Firecrawl to parse and structure the data
    // You can use Firecrawl's scrape endpoint with the HTML content
    // or scrape the URL directly if the content is now cached
    const result = await firecrawl.scrapeUrl(url, {
        formats: ['markdown'],
        waitFor: 0 // Content already loaded
    });

    return result;
}

3. Handling Pop-ups and Modals

For sites with complex modal interactions, you can handle them with Playwright before scraping, similar to handling pop-ups and modals in Puppeteer:

from playwright.sync_api import sync_playwright
from firecrawl import FirecrawlApp

def handle_modals_and_scrape(url):
    firecrawl = FirecrawlApp(api_key='your_api_key')

    with sync_playwright() as p:
        browser = p.chromium.launch()
        context = browser.new_context()
        page = context.new_page()

        # Setup dialog handler
        page.on('dialog', lambda dialog: dialog.accept())

        # Navigate to page
        page.goto(url)

        # Wait for and handle cookie consent modal
        try:
            page.wait_for_selector('#cookie-consent', timeout=5000)
            page.click('#accept-cookies')
        except:
            pass

        # Handle newsletter popup
        try:
            page.wait_for_selector('.newsletter-modal', timeout=3000)
            page.click('.close-modal')
        except:
            pass

        # Get cookies after handling modals
        cookies = context.cookies()
        browser.close()

        # Scrape with Firecrawl
        cookie_header = '; '.join([f"{c['name']}={c['value']}" for c in cookies])
        result = firecrawl.scrape_url(url, {
            'formats': ['markdown'],
            'headers': {'Cookie': cookie_header}
        })

        return result

4. Monitoring Network Requests

Combining Playwright's network monitoring with Firecrawl's scraping capabilities, similar to monitoring network requests in Puppeteer:

const { chromium } = require('playwright');
const FirecrawlApp = require('@mendable/firecrawl-js').default;

async function monitorAndScrape(url) {
    const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key' });
    const browser = await chromium.launch();
    const context = await browser.newContext();
    const page = await context.newPage();

    const apiCalls = [];

    // Monitor API calls
    page.on('response', async (response) => {
        if (response.url().includes('/api/')) {
            try {
                const data = await response.json();
                apiCalls.push({
                    url: response.url(),
                    status: response.status(),
                    data: data
                });
            } catch (e) {
                // Not JSON response
            }
        }
    });

    await page.goto(url, { waitUntil: 'networkidle' });
    await page.waitForTimeout(2000);

    await browser.close();

    // Scrape the page with Firecrawl
    const scrapedData = await firecrawl.scrapeUrl(url, {
        formats: ['markdown']
    });

    return {
        scrapedContent: scrapedData,
        apiCalls: apiCalls
    };
}

Best Practices

1. Choose the Right Tool for the Job

  • Use Firecrawl alone when you need clean markdown output, bulk crawling, or simple JavaScript-rendered content
  • Use Playwright alone when you need complex browser interactions without content extraction
  • Combine both when you need complex interactions followed by efficient content extraction

2. Optimize Performance

# Bad: Using Playwright for everything
def slow_scrape(urls):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        for url in urls:
            page = browser.new_page()
            page.goto(url)
            content = page.content()
            # Manual parsing...
        browser.close()

# Good: Use Playwright for auth, Firecrawl for bulk scraping
def fast_scrape(urls):
    # Authenticate once with Playwright
    cookies = authenticate_with_playwright()

    # Bulk scrape with Firecrawl
    firecrawl = FirecrawlApp(api_key='key')
    results = firecrawl.batch_scrape_urls(urls, {
        'headers': {'Cookie': cookies}
    })
    return results

3. Error Handling

async function robustScrape(url) {
    const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key' });
    let browser;

    try {
        browser = await chromium.launch();
        const page = await browser.newPage();

        // Set timeout
        page.setDefaultTimeout(30000);

        await page.goto(url, { waitUntil: 'networkidle' });

        // Perform interactions with error handling
        try {
            await page.click('#load-more', { timeout: 5000 });
        } catch (e) {
            console.log('Load more button not found, continuing...');
        }

        const cookies = await page.context().cookies();
        await browser.close();

        // Scrape with Firecrawl
        const result = await firecrawl.scrapeUrl(url, {
            formats: ['markdown'],
            headers: {
                'Cookie': cookies.map(c => `${c.name}=${c.value}`).join('; ')
            }
        });

        return result;

    } catch (error) {
        if (browser) await browser.close();

        // Fallback: Try Firecrawl alone
        console.log('Playwright failed, trying Firecrawl alone...');
        return await firecrawl.scrapeUrl(url, {
            formats: ['markdown']
        });
    }
}

Comparison: Firecrawl vs Playwright

| Feature | Firecrawl | Playwright | Combined | |---------|-----------|------------|----------| | JavaScript Rendering | ✅ Built-in | ✅ Full control | ✅ Best of both | | Bulk Crawling | ✅ Optimized | ❌ Manual | ✅ Efficient | | Complex Interactions | ⚠️ Limited | ✅ Complete | ✅ Complete | | Markdown Output | ✅ Clean | ❌ Manual | ✅ Clean | | Authentication | ⚠️ Basic | ✅ Advanced | ✅ Advanced | | Cost | API-based | Server cost | Both |

Conclusion

Combining Firecrawl with Playwright gives you the best of both worlds: Playwright's powerful browser automation capabilities for complex interactions and authentication, paired with Firecrawl's efficient content extraction and clean markdown output. Use Playwright for the heavy lifting of browser automation, then let Firecrawl handle the content extraction and structuring.

For most use cases, Firecrawl's built-in JavaScript rendering is sufficient. Reserve the Playwright + Firecrawl combination for scenarios requiring complex authentication, multi-step interactions, or custom browser behavior that Firecrawl's waitFor parameter cannot handle.

Remember to always respect website terms of service, implement rate limiting, and handle errors gracefully in production environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon