How do I use Firecrawl with Puppeteer for web scraping?
Firecrawl is a modern web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. While Firecrawl handles JavaScript rendering internally, you can also integrate it with Puppeteer to create powerful hybrid scraping workflows that combine Firecrawl's data extraction capabilities with Puppeteer's browser automation features.
Understanding Firecrawl and Puppeteer Integration
Firecrawl and Puppeteer serve complementary purposes in web scraping:
- Firecrawl: Provides managed infrastructure for scraping, automatic JavaScript rendering, and intelligent data extraction
- Puppeteer: Offers fine-grained browser control for complex interactions, authentication flows, and custom JavaScript execution
Combining both tools allows you to leverage Firecrawl's simplicity for data extraction while using Puppeteer for scenarios requiring custom browser automation.
Installation and Setup
Installing Dependencies
First, install both Firecrawl and Puppeteer:
npm install @mendable/firecrawl-js puppeteer
For Python projects:
pip install firecrawl-py puppeteer-python
Basic Configuration
Set up your Firecrawl API key as an environment variable:
export FIRECRAWL_API_KEY='your_api_key_here'
Approach 1: Using Firecrawl for Simple Scraping
For most web scraping tasks, Firecrawl alone is sufficient and simpler than using Puppeteer:
import FirecrawlApp from '@mendable/firecrawl-js';
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
async function scrapeWithFirecrawl(url) {
try {
// Scrape a single page
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown', 'html'],
onlyMainContent: true,
waitFor: 2000 // Wait for JavaScript to render
});
console.log('Markdown content:', result.markdown);
console.log('Metadata:', result.metadata);
return result;
} catch (error) {
console.error('Scraping failed:', error);
}
}
// Usage
scrapeWithFirecrawl('https://example.com');
Python equivalent:
from firecrawl import FirecrawlApp
firecrawl = FirecrawlApp(api_key='your_api_key')
def scrape_with_firecrawl(url):
result = firecrawl.scrape_url(
url,
params={
'formats': ['markdown', 'html'],
'onlyMainContent': True,
'waitFor': 2000
}
)
print('Markdown:', result['markdown'])
print('Metadata:', result['metadata'])
return result
# Usage
scrape_with_firecrawl('https://example.com')
Approach 2: Puppeteer Pre-processing with Firecrawl Extraction
Use Puppeteer to handle complex authentication or interactions, then pass the resulting page to Firecrawl for data extraction:
import puppeteer from 'puppeteer';
import FirecrawlApp from '@mendable/firecrawl-js';
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
async function scrapeWithAuthentication(loginUrl, targetUrl, credentials) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
try {
// Navigate to login page
await page.goto(loginUrl, { waitUntil: 'networkidle2' });
// Perform login using Puppeteer
await page.type('#username', credentials.username);
await page.type('#password', credentials.password);
await page.click('button[type="submit"]');
// Wait for navigation after login
await page.waitForNavigation({ waitUntil: 'networkidle2' });
// Get cookies for authenticated session
const cookies = await page.cookies();
// Close browser
await browser.close();
// Use Firecrawl with authenticated session
const result = await firecrawl.scrapeUrl(targetUrl, {
formats: ['markdown'],
headers: {
'Cookie': cookies.map(c => `${c.name}=${c.value}`).join('; ')
}
});
return result;
} catch (error) {
await browser.close();
console.error('Error:', error);
}
}
// Usage
scrapeWithAuthentication(
'https://example.com/login',
'https://example.com/protected-page',
{ username: 'user', password: 'pass' }
);
This approach is particularly useful when you need to handle authentication in Puppeteer before scraping protected content.
Approach 3: Parallel Processing with Both Tools
Leverage Puppeteer for browser-based tasks while using Firecrawl's crawling capabilities for site-wide data extraction:
import puppeteer from 'puppeteer';
import FirecrawlApp from '@mendable/firecrawl-js';
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
async function hybridCrawl(baseUrl) {
// Use Firecrawl to discover all pages
const crawlResult = await firecrawl.crawlUrl(baseUrl, {
limit: 100,
scrapeOptions: {
formats: ['markdown']
}
});
// Launch Puppeteer for screenshot generation
const browser = await puppeteer.launch();
const results = [];
for (const page of crawlResult.data) {
// Get content from Firecrawl
const content = page.markdown;
// Use Puppeteer for screenshot
const puppeteerPage = await browser.newPage();
await puppeteerPage.goto(page.metadata.sourceURL);
const screenshot = await puppeteerPage.screenshot({
fullPage: true
});
await puppeteerPage.close();
results.push({
url: page.metadata.sourceURL,
content: content,
screenshot: screenshot
});
}
await browser.close();
return results;
}
// Usage
hybridCrawl('https://example.com');
Approach 4: Using Firecrawl's Built-in Actions (Recommended)
Firecrawl now supports browser actions natively, eliminating the need for Puppeteer in many cases:
import FirecrawlApp from '@mendable/firecrawl-js';
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
async function scrapeWithActions(url) {
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown'],
actions: [
{ type: 'wait', milliseconds: 2000 },
{ type: 'click', selector: 'button.load-more' },
{ type: 'wait', milliseconds: 1000 },
{ type: 'scroll', direction: 'down' },
{ type: 'screenshot' }
]
});
return result;
}
// Usage
scrapeWithActions('https://example.com/dynamic-content');
This approach handles many scenarios that previously required Puppeteer, such as waiting for elements, clicking buttons, and scrolling pages. Learn more about handling AJAX requests using Puppeteer to understand when browser automation is necessary.
Advanced Pattern: Custom JavaScript Execution
Combine Puppeteer's custom JavaScript injection with Firecrawl's extraction:
import puppeteer from 'puppeteer';
import FirecrawlApp from '@mendable/firecrawl-js';
async function scrapeWithCustomJS(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Execute custom JavaScript
await page.evaluate(() => {
// Remove ads and popups
document.querySelectorAll('.ad, .popup').forEach(el => el.remove());
// Trigger lazy-loaded content
window.scrollTo(0, document.body.scrollHeight);
});
await page.waitForTimeout(2000);
// Get the cleaned HTML
const html = await page.content();
await browser.close();
// Use Firecrawl to parse the cleaned HTML
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
// Note: Firecrawl works with URLs, so you might need to use
// the HTML content directly or save it temporarily
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown'],
onlyMainContent: true
});
return result;
}
Handling Dynamic Content
For single-page applications and dynamic websites, you can use Firecrawl's waitFor option:
const result = await firecrawl.scrapeUrl('https://spa-example.com', {
formats: ['markdown'],
waitFor: 3000, // Wait 3 seconds for JavaScript to render
timeout: 30000 // Overall timeout
});
For more complex scenarios requiring specific element waits, you might want to understand how to crawl a single page application (SPA) using Puppeteer.
Error Handling and Retries
Implement robust error handling for both tools:
async function robustScrape(url, maxRetries = 3) {
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await firecrawl.scrapeUrl(url, {
formats: ['markdown'],
timeout: 30000
});
return result;
} catch (error) {
console.error(`Attempt ${attempt} failed:`, error.message);
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries} attempts`);
}
// Exponential backoff
await new Promise(resolve =>
setTimeout(resolve, 1000 * Math.pow(2, attempt))
);
}
}
}
Performance Optimization
When processing multiple URLs, use concurrent requests with rate limiting:
import pLimit from 'p-limit';
async function batchScrape(urls) {
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
const limit = pLimit(5); // Limit to 5 concurrent requests
const promises = urls.map(url =>
limit(async () => {
try {
return await firecrawl.scrapeUrl(url, {
formats: ['markdown']
});
} catch (error) {
console.error(`Failed to scrape ${url}:`, error);
return null;
}
})
);
return await Promise.all(promises);
}
// Usage
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
batchScrape(urls).then(results => {
console.log('Scraped pages:', results.filter(r => r !== null).length);
});
When to Use Which Tool
Use Firecrawl alone when: - You need clean, markdown-formatted content - The site doesn't require complex authentication - You want managed infrastructure and automatic retries - You need to crawl entire websites systematically
Use Puppeteer alone when: - You need complete control over browser behavior - You're performing complex UI testing - You need to interact with browser APIs directly
Use both together when: - You need custom authentication flows before scraping - You require pre-processing of pages before extraction - You want to combine screenshots with content extraction - You need to execute custom JavaScript before data extraction
Conclusion
While Firecrawl and Puppeteer can work together, Firecrawl's modern API often eliminates the need for Puppeteer in most web scraping scenarios. Firecrawl handles JavaScript rendering, provides clean markdown output, and includes built-in actions for common browser interactions. Reserve Puppeteer for cases requiring fine-grained browser control or complex authentication workflows.
For simpler scraping needs without the complexity of either tool, consider using a managed web scraping API that handles proxies, JavaScript rendering, and data extraction automatically.