How do I use Firecrawl with Playwright for Browser Automation?
While Firecrawl is a powerful web scraping and crawling API that handles JavaScript-rendered content out of the box, there are scenarios where you might want to combine it with Playwright for advanced browser automation tasks. This guide explores different integration patterns, use cases, and practical implementations.
Understanding Firecrawl and Playwright
Firecrawl is a web scraping API that converts websites into clean, LLM-ready markdown or structured data. It handles JavaScript rendering, dynamic content, and provides features like sitemap crawling and automatic content extraction.
Playwright is a browser automation framework that provides low-level control over Chromium, Firefox, and WebKit browsers. It excels at complex interactions, custom JavaScript execution, and scenarios requiring fine-grained browser control.
When to Combine Firecrawl with Playwright
Before diving into integration, consider these use cases where combining both tools makes sense:
- Pre-authentication with Playwright: Use Playwright to handle complex OAuth flows or multi-step authentication, then pass authenticated session data to Firecrawl
- Complex User Interactions: Perform intricate user interactions with Playwright before scraping with Firecrawl
- Hybrid Workflows: Use Playwright for dynamic navigation and Firecrawl for efficient bulk content extraction
- Custom JavaScript Injection: Execute custom scripts with Playwright, then leverage Firecrawl's parsing capabilities
Basic Integration Pattern
Here's a fundamental pattern for using Firecrawl alongside Playwright:
Python Implementation
from playwright.sync_api import sync_playwright
from firecrawl import FirecrawlApp
import time
# Initialize Firecrawl
firecrawl = FirecrawlApp(api_key='your_api_key_here')
def scrape_with_playwright_and_firecrawl(url):
with sync_playwright() as p:
# Launch browser with Playwright
browser = p.chromium.launch(headless=True)
context = browser.new_context()
page = context.new_page()
# Navigate and perform complex interactions
page.goto(url)
page.wait_for_load_state('networkidle')
# Example: Handle authentication or complex navigation
page.click('button#login')
page.fill('input[name="username"]', 'your_username')
page.fill('input[name="password"]', 'your_password')
page.click('button[type="submit"]')
page.wait_for_selector('.dashboard')
# Get cookies from authenticated session
cookies = context.cookies()
# Close Playwright
browser.close()
# Now use Firecrawl with authenticated context
result = firecrawl.scrape_url(url, {
'formats': ['markdown', 'html'],
'headers': {
'Cookie': '; '.join([f"{c['name']}={c['value']}" for c in cookies])
}
})
return result
# Usage
data = scrape_with_playwright_and_firecrawl('https://example.com')
print(data['markdown'])
JavaScript/Node.js Implementation
const { chromium } = require('playwright');
const FirecrawlApp = require('@mendable/firecrawl-js').default;
const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key_here' });
async function scrapeWithPlaywrightAndFirecrawl(url) {
// Launch Playwright browser
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
try {
// Navigate and perform interactions
await page.goto(url, { waitUntil: 'networkidle' });
// Example: Complex interaction
await page.click('button#load-more');
await page.waitForSelector('.dynamic-content');
// Get the final URL after redirects/navigation
const finalUrl = page.url();
// Extract cookies for authenticated requests
const cookies = await context.cookies();
const cookieHeader = cookies.map(c => `${c.name}=${c.value}`).join('; ');
await browser.close();
// Use Firecrawl to scrape the content
const result = await firecrawl.scrapeUrl(finalUrl, {
formats: ['markdown', 'html'],
headers: {
'Cookie': cookieHeader
}
});
return result;
} catch (error) {
await browser.close();
throw error;
}
}
// Usage
scrapeWithPlaywrightAndFirecrawl('https://example.com')
.then(data => console.log(data.markdown))
.catch(error => console.error(error));
Advanced Use Cases
1. Session Management and Authentication
For sites requiring complex authentication flows, similar to handling authentication in Puppeteer:
from playwright.sync_api import sync_playwright
from firecrawl import FirecrawlApp
import json
def authenticate_and_scrape(login_url, target_url, credentials):
firecrawl = FirecrawlApp(api_key='your_api_key')
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context()
page = context.new_page()
# Navigate to login page
page.goto(login_url)
# Handle multi-step authentication
page.fill('#username', credentials['username'])
page.click('#next-button')
page.wait_for_selector('#password')
page.fill('#password', credentials['password'])
page.click('#login-button')
# Wait for authentication to complete
page.wait_for_url('**/dashboard**')
# Extract session storage and cookies
cookies = context.cookies()
storage_state = context.storage_state()
browser.close()
# Convert cookies to header format
cookie_header = '; '.join([f"{c['name']}={c['value']}" for c in cookies])
# Use Firecrawl with authenticated session
results = firecrawl.crawl_url(target_url, {
'limit': 100,
'headers': {
'Cookie': cookie_header
}
})
return results
# Usage
data = authenticate_and_scrape(
'https://example.com/login',
'https://example.com/protected-area',
{'username': 'user', 'password': 'pass'}
)
2. Dynamic Content Loading
When dealing with infinite scroll or dynamic content loading, you can use Playwright to trigger content loading before letting Firecrawl extract the data:
const { chromium } = require('playwright');
const FirecrawlApp = require('@mendable/firecrawl-js').default;
async function scrapeInfiniteScroll(url) {
const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key' });
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle' });
// Scroll to load all content
let previousHeight = 0;
while (true) {
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
previousHeight = currentHeight;
}
// Get the fully loaded HTML
const html = await page.content();
await browser.close();
// Now use Firecrawl to parse and structure the data
// You can use Firecrawl's scrape endpoint with the HTML content
// or scrape the URL directly if the content is now cached
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown'],
waitFor: 0 // Content already loaded
});
return result;
}
3. Handling Pop-ups and Modals
For sites with complex modal interactions, you can handle them with Playwright before scraping, similar to handling pop-ups and modals in Puppeteer:
from playwright.sync_api import sync_playwright
from firecrawl import FirecrawlApp
def handle_modals_and_scrape(url):
firecrawl = FirecrawlApp(api_key='your_api_key')
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context()
page = context.new_page()
# Setup dialog handler
page.on('dialog', lambda dialog: dialog.accept())
# Navigate to page
page.goto(url)
# Wait for and handle cookie consent modal
try:
page.wait_for_selector('#cookie-consent', timeout=5000)
page.click('#accept-cookies')
except:
pass
# Handle newsletter popup
try:
page.wait_for_selector('.newsletter-modal', timeout=3000)
page.click('.close-modal')
except:
pass
# Get cookies after handling modals
cookies = context.cookies()
browser.close()
# Scrape with Firecrawl
cookie_header = '; '.join([f"{c['name']}={c['value']}" for c in cookies])
result = firecrawl.scrape_url(url, {
'formats': ['markdown'],
'headers': {'Cookie': cookie_header}
})
return result
4. Monitoring Network Requests
Combining Playwright's network monitoring with Firecrawl's scraping capabilities, similar to monitoring network requests in Puppeteer:
const { chromium } = require('playwright');
const FirecrawlApp = require('@mendable/firecrawl-js').default;
async function monitorAndScrape(url) {
const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key' });
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
const apiCalls = [];
// Monitor API calls
page.on('response', async (response) => {
if (response.url().includes('/api/')) {
try {
const data = await response.json();
apiCalls.push({
url: response.url(),
status: response.status(),
data: data
});
} catch (e) {
// Not JSON response
}
}
});
await page.goto(url, { waitUntil: 'networkidle' });
await page.waitForTimeout(2000);
await browser.close();
// Scrape the page with Firecrawl
const scrapedData = await firecrawl.scrapeUrl(url, {
formats: ['markdown']
});
return {
scrapedContent: scrapedData,
apiCalls: apiCalls
};
}
Best Practices
1. Choose the Right Tool for the Job
- Use Firecrawl alone when you need clean markdown output, bulk crawling, or simple JavaScript-rendered content
- Use Playwright alone when you need complex browser interactions without content extraction
- Combine both when you need complex interactions followed by efficient content extraction
2. Optimize Performance
# Bad: Using Playwright for everything
def slow_scrape(urls):
with sync_playwright() as p:
browser = p.chromium.launch()
for url in urls:
page = browser.new_page()
page.goto(url)
content = page.content()
# Manual parsing...
browser.close()
# Good: Use Playwright for auth, Firecrawl for bulk scraping
def fast_scrape(urls):
# Authenticate once with Playwright
cookies = authenticate_with_playwright()
# Bulk scrape with Firecrawl
firecrawl = FirecrawlApp(api_key='key')
results = firecrawl.batch_scrape_urls(urls, {
'headers': {'Cookie': cookies}
})
return results
3. Error Handling
async function robustScrape(url) {
const firecrawl = new FirecrawlApp({ apiKey: 'your_api_key' });
let browser;
try {
browser = await chromium.launch();
const page = await browser.newPage();
// Set timeout
page.setDefaultTimeout(30000);
await page.goto(url, { waitUntil: 'networkidle' });
// Perform interactions with error handling
try {
await page.click('#load-more', { timeout: 5000 });
} catch (e) {
console.log('Load more button not found, continuing...');
}
const cookies = await page.context().cookies();
await browser.close();
// Scrape with Firecrawl
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown'],
headers: {
'Cookie': cookies.map(c => `${c.name}=${c.value}`).join('; ')
}
});
return result;
} catch (error) {
if (browser) await browser.close();
// Fallback: Try Firecrawl alone
console.log('Playwright failed, trying Firecrawl alone...');
return await firecrawl.scrapeUrl(url, {
formats: ['markdown']
});
}
}
Comparison: Firecrawl vs Playwright
| Feature | Firecrawl | Playwright | Combined | |---------|-----------|------------|----------| | JavaScript Rendering | ✅ Built-in | ✅ Full control | ✅ Best of both | | Bulk Crawling | ✅ Optimized | ❌ Manual | ✅ Efficient | | Complex Interactions | ⚠️ Limited | ✅ Complete | ✅ Complete | | Markdown Output | ✅ Clean | ❌ Manual | ✅ Clean | | Authentication | ⚠️ Basic | ✅ Advanced | ✅ Advanced | | Cost | API-based | Server cost | Both |
Conclusion
Combining Firecrawl with Playwright gives you the best of both worlds: Playwright's powerful browser automation capabilities for complex interactions and authentication, paired with Firecrawl's efficient content extraction and clean markdown output. Use Playwright for the heavy lifting of browser automation, then let Firecrawl handle the content extraction and structuring.
For most use cases, Firecrawl's built-in JavaScript rendering is sufficient. Reserve the Playwright + Firecrawl combination for scenarios requiring complex authentication, multi-step interactions, or custom browser behavior that Firecrawl's waitFor
parameter cannot handle.
Remember to always respect website terms of service, implement rate limiting, and handle errors gracefully in production environments.