How do I use Firecrawl with Node.js?
Firecrawl is a powerful web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. When working with Node.js, Firecrawl provides an official SDK that makes it easy to scrape single pages, crawl entire websites, and extract structured data with minimal configuration.
Installing Firecrawl for Node.js
To get started with Firecrawl in your Node.js project, install the official SDK using npm or yarn:
npm install @mendable/firecrawl-js
Or with yarn:
yarn add @mendable/firecrawl-js
Setting Up Firecrawl
Before using Firecrawl, you'll need to obtain an API key from the Firecrawl dashboard. Once you have your API key, initialize the Firecrawl client:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });
For better security, store your API key in environment variables:
import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';
dotenv.config();
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
Scraping a Single Page
The most basic operation in Firecrawl is scraping a single page. The scrapeUrl
method fetches a URL and returns its content in various formats:
async function scrapePage() {
const scrapeResult = await app.scrapeUrl('https://example.com', {
formats: ['markdown', 'html']
});
console.log(scrapeResult.markdown);
console.log(scrapeResult.html);
}
scrapePage();
Scraping Options
Firecrawl supports various options to customize the scraping behavior:
const scrapeResult = await app.scrapeUrl('https://example.com', {
formats: ['markdown', 'html', 'rawHtml', 'links', 'screenshot'],
onlyMainContent: true,
includeTags: ['article', 'main'],
excludeTags: ['nav', 'footer'],
waitFor: 2000, // Wait 2 seconds for JavaScript to load
timeout: 30000 // 30 second timeout
});
This approach is particularly useful when you need to extract content from JavaScript-heavy pages, similar to how Puppeteer handles AJAX requests.
Crawling Multiple Pages
Firecrawl excels at crawling entire websites, automatically discovering and scraping linked pages. Use the crawlUrl
method to start a crawl job:
async function crawlWebsite() {
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 100,
scrapeOptions: {
formats: ['markdown']
}
});
console.log(`Crawled ${crawlResult.data.length} pages`);
crawlResult.data.forEach((page, index) => {
console.log(`Page ${index + 1}: ${page.metadata.sourceURL}`);
console.log(page.markdown.substring(0, 200) + '...\n');
});
}
crawlWebsite();
Advanced Crawling Options
Control the crawling behavior with additional options:
const crawlResult = await app.crawlUrl('https://example.com', {
limit: 100,
maxDepth: 3,
allowBackwardLinks: false,
allowExternalLinks: false,
ignoreSitemap: false,
scrapeOptions: {
formats: ['markdown', 'html'],
onlyMainContent: true
}
});
Asynchronous Crawling for Large Sites
For large websites, use asynchronous crawling to avoid timeouts:
async function asyncCrawl() {
const crawlId = await app.asyncCrawlUrl('https://example.com', {
limit: 1000,
scrapeOptions: {
formats: ['markdown']
}
});
console.log(`Crawl started with ID: ${crawlId}`);
// Check crawl status
let status = 'scraping';
while (status === 'scraping') {
const statusResponse = await app.checkCrawlStatus(crawlId);
status = statusResponse.status;
console.log(`Status: ${status}, Completed: ${statusResponse.completed}/${statusResponse.total}`);
if (status === 'scraping') {
await new Promise(resolve => setTimeout(resolve, 5000)); // Wait 5 seconds
}
}
// Get results
if (status === 'completed') {
const results = await app.crawlStatus(crawlId);
console.log(`Crawled ${results.data.length} pages`);
}
}
asyncCrawl();
Extracting Structured Data
Firecrawl can extract structured data using LLM-based extraction. Define a schema and let Firecrawl extract the data:
async function extractStructuredData() {
const extractResult = await app.scrapeUrl('https://example.com/product', {
formats: ['extract'],
extract: {
schema: {
type: 'object',
properties: {
productName: { type: 'string' },
price: { type: 'number' },
description: { type: 'string' },
availability: { type: 'string' },
rating: { type: 'number' }
},
required: ['productName', 'price']
}
}
});
console.log(extractResult.extract);
}
extractStructuredData();
Batch Extraction
Extract data from multiple pages:
async function batchExtract() {
const urls = [
'https://example.com/product1',
'https://example.com/product2',
'https://example.com/product3'
];
const schema = {
type: 'object',
properties: {
productName: { type: 'string' },
price: { type: 'number' }
}
};
const results = await Promise.all(
urls.map(url =>
app.scrapeUrl(url, {
formats: ['extract'],
extract: { schema }
})
)
);
const products = results.map(r => r.extract);
console.log(products);
}
batchExtract();
Handling Authentication and Headers
Firecrawl supports custom headers for authenticated scraping:
const scrapeResult = await app.scrapeUrl('https://example.com/private', {
formats: ['markdown'],
headers: {
'Authorization': 'Bearer YOUR_TOKEN',
'Custom-Header': 'value'
}
});
Using Firecrawl with TypeScript
Firecrawl provides TypeScript type definitions out of the box:
import FirecrawlApp, { ScrapeResponse, CrawlResponse } from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
interface ProductData {
productName: string;
price: number;
description: string;
}
async function scrapeProduct(url: string): Promise<ProductData> {
const result: ScrapeResponse = await app.scrapeUrl(url, {
formats: ['extract'],
extract: {
schema: {
type: 'object',
properties: {
productName: { type: 'string' },
price: { type: 'number' },
description: { type: 'string' }
}
}
}
});
return result.extract as ProductData;
}
Error Handling
Always implement proper error handling when using Firecrawl:
async function safeScrap(url) {
try {
const result = await app.scrapeUrl(url, {
formats: ['markdown'],
timeout: 30000
});
return result;
} catch (error) {
if (error.response) {
// API error
console.error(`API Error: ${error.response.status} - ${error.response.data.error}`);
} else if (error.request) {
// Network error
console.error('Network error:', error.message);
} else {
// Other errors
console.error('Error:', error.message);
}
return null;
}
}
Similar to how you handle errors in Puppeteer, proper error handling ensures your scraping application is robust and resilient.
Monitoring and Rate Limiting
Implement rate limiting to avoid overwhelming the API:
import pLimit from 'p-limit';
const limit = pLimit(5); // Max 5 concurrent requests
async function scrapeMultipleUrls(urls) {
const promises = urls.map(url =>
limit(() => app.scrapeUrl(url, { formats: ['markdown'] }))
);
const results = await Promise.all(promises);
return results;
}
const urls = [
'https://example.com/page1',
'https://example.com/page2',
// ... more URLs
];
scrapeMultipleUrls(urls).then(results => {
console.log(`Scraped ${results.length} pages`);
});
Working with Maps and Sitemaps
Firecrawl can generate a map of all URLs on a website without scraping content:
async function mapWebsite() {
const mapResult = await app.mapUrl('https://example.com', {
search: 'product',
limit: 500
});
console.log(`Found ${mapResult.links.length} URLs`);
mapResult.links.forEach(link => console.log(link));
}
mapWebsite();
Complete Example: E-commerce Product Scraper
Here's a complete example that combines multiple Firecrawl features:
import FirecrawlApp from '@mendable/firecrawl-js';
import fs from 'fs';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
async function scrapeEcommerceProducts() {
// Step 1: Map the website to find product URLs
const mapResult = await app.mapUrl('https://example-shop.com', {
search: '/product/',
limit: 100
});
console.log(`Found ${mapResult.links.length} product URLs`);
// Step 2: Extract structured data from each product
const productSchema = {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
description: { type: 'string' },
imageUrl: { type: 'string' },
inStock: { type: 'boolean' }
}
};
const products = [];
for (const url of mapResult.links.slice(0, 10)) {
try {
const result = await app.scrapeUrl(url, {
formats: ['extract'],
extract: { schema: productSchema }
});
products.push({
url,
...result.extract
});
console.log(`Scraped: ${result.extract.name}`);
// Rate limiting
await new Promise(resolve => setTimeout(resolve, 1000));
} catch (error) {
console.error(`Failed to scrape ${url}:`, error.message);
}
}
// Step 3: Save results
fs.writeFileSync('products.json', JSON.stringify(products, null, 2));
console.log(`Saved ${products.length} products to products.json`);
}
scrapeEcommerceProducts();
Best Practices
- Use Environment Variables: Always store API keys in environment variables, never hardcode them
- Implement Rate Limiting: Respect API rate limits to avoid being throttled
- Handle Errors Gracefully: Implement comprehensive error handling for network issues and API errors
- Use Appropriate Formats: Choose the right output format for your use case (markdown for content, extract for structured data)
- Monitor Costs: Track your API usage to manage costs, especially for large crawling operations
- Cache Results: Store scraped data to avoid redundant API calls
- Set Timeouts: Use appropriate timeout values based on the complexity of pages you're scraping
Conclusion
Firecrawl provides a powerful and developer-friendly way to scrape web content with Node.js. Whether you need to scrape single pages, crawl entire websites, or extract structured data, Firecrawl's SDK makes it straightforward with its clean API and comprehensive features. By following the examples and best practices outlined in this guide, you can build robust web scraping applications that handle modern websites effectively.
For more advanced scenarios like handling dynamic content and JavaScript-heavy pages, consider exploring how Puppeteer handles single page applications, which can complement your Firecrawl implementation when you need even more control over browser automation.