Table of contents

How do I use Firecrawl with Node.js?

Firecrawl is a powerful web scraping and crawling API that converts websites into clean, LLM-ready markdown or structured data. When working with Node.js, Firecrawl provides an official SDK that makes it easy to scrape single pages, crawl entire websites, and extract structured data with minimal configuration.

Installing Firecrawl for Node.js

To get started with Firecrawl in your Node.js project, install the official SDK using npm or yarn:

npm install @mendable/firecrawl-js

Or with yarn:

yarn add @mendable/firecrawl-js

Setting Up Firecrawl

Before using Firecrawl, you'll need to obtain an API key from the Firecrawl dashboard. Once you have your API key, initialize the Firecrawl client:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });

For better security, store your API key in environment variables:

import FirecrawlApp from '@mendable/firecrawl-js';
import dotenv from 'dotenv';

dotenv.config();

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

Scraping a Single Page

The most basic operation in Firecrawl is scraping a single page. The scrapeUrl method fetches a URL and returns its content in various formats:

async function scrapePage() {
  const scrapeResult = await app.scrapeUrl('https://example.com', {
    formats: ['markdown', 'html']
  });

  console.log(scrapeResult.markdown);
  console.log(scrapeResult.html);
}

scrapePage();

Scraping Options

Firecrawl supports various options to customize the scraping behavior:

const scrapeResult = await app.scrapeUrl('https://example.com', {
  formats: ['markdown', 'html', 'rawHtml', 'links', 'screenshot'],
  onlyMainContent: true,
  includeTags: ['article', 'main'],
  excludeTags: ['nav', 'footer'],
  waitFor: 2000, // Wait 2 seconds for JavaScript to load
  timeout: 30000 // 30 second timeout
});

This approach is particularly useful when you need to extract content from JavaScript-heavy pages, similar to how Puppeteer handles AJAX requests.

Crawling Multiple Pages

Firecrawl excels at crawling entire websites, automatically discovering and scraping linked pages. Use the crawlUrl method to start a crawl job:

async function crawlWebsite() {
  const crawlResult = await app.crawlUrl('https://example.com', {
    limit: 100,
    scrapeOptions: {
      formats: ['markdown']
    }
  });

  console.log(`Crawled ${crawlResult.data.length} pages`);

  crawlResult.data.forEach((page, index) => {
    console.log(`Page ${index + 1}: ${page.metadata.sourceURL}`);
    console.log(page.markdown.substring(0, 200) + '...\n');
  });
}

crawlWebsite();

Advanced Crawling Options

Control the crawling behavior with additional options:

const crawlResult = await app.crawlUrl('https://example.com', {
  limit: 100,
  maxDepth: 3,
  allowBackwardLinks: false,
  allowExternalLinks: false,
  ignoreSitemap: false,
  scrapeOptions: {
    formats: ['markdown', 'html'],
    onlyMainContent: true
  }
});

Asynchronous Crawling for Large Sites

For large websites, use asynchronous crawling to avoid timeouts:

async function asyncCrawl() {
  const crawlId = await app.asyncCrawlUrl('https://example.com', {
    limit: 1000,
    scrapeOptions: {
      formats: ['markdown']
    }
  });

  console.log(`Crawl started with ID: ${crawlId}`);

  // Check crawl status
  let status = 'scraping';
  while (status === 'scraping') {
    const statusResponse = await app.checkCrawlStatus(crawlId);
    status = statusResponse.status;
    console.log(`Status: ${status}, Completed: ${statusResponse.completed}/${statusResponse.total}`);

    if (status === 'scraping') {
      await new Promise(resolve => setTimeout(resolve, 5000)); // Wait 5 seconds
    }
  }

  // Get results
  if (status === 'completed') {
    const results = await app.crawlStatus(crawlId);
    console.log(`Crawled ${results.data.length} pages`);
  }
}

asyncCrawl();

Extracting Structured Data

Firecrawl can extract structured data using LLM-based extraction. Define a schema and let Firecrawl extract the data:

async function extractStructuredData() {
  const extractResult = await app.scrapeUrl('https://example.com/product', {
    formats: ['extract'],
    extract: {
      schema: {
        type: 'object',
        properties: {
          productName: { type: 'string' },
          price: { type: 'number' },
          description: { type: 'string' },
          availability: { type: 'string' },
          rating: { type: 'number' }
        },
        required: ['productName', 'price']
      }
    }
  });

  console.log(extractResult.extract);
}

extractStructuredData();

Batch Extraction

Extract data from multiple pages:

async function batchExtract() {
  const urls = [
    'https://example.com/product1',
    'https://example.com/product2',
    'https://example.com/product3'
  ];

  const schema = {
    type: 'object',
    properties: {
      productName: { type: 'string' },
      price: { type: 'number' }
    }
  };

  const results = await Promise.all(
    urls.map(url =>
      app.scrapeUrl(url, {
        formats: ['extract'],
        extract: { schema }
      })
    )
  );

  const products = results.map(r => r.extract);
  console.log(products);
}

batchExtract();

Handling Authentication and Headers

Firecrawl supports custom headers for authenticated scraping:

const scrapeResult = await app.scrapeUrl('https://example.com/private', {
  formats: ['markdown'],
  headers: {
    'Authorization': 'Bearer YOUR_TOKEN',
    'Custom-Header': 'value'
  }
});

Using Firecrawl with TypeScript

Firecrawl provides TypeScript type definitions out of the box:

import FirecrawlApp, { ScrapeResponse, CrawlResponse } from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });

interface ProductData {
  productName: string;
  price: number;
  description: string;
}

async function scrapeProduct(url: string): Promise<ProductData> {
  const result: ScrapeResponse = await app.scrapeUrl(url, {
    formats: ['extract'],
    extract: {
      schema: {
        type: 'object',
        properties: {
          productName: { type: 'string' },
          price: { type: 'number' },
          description: { type: 'string' }
        }
      }
    }
  });

  return result.extract as ProductData;
}

Error Handling

Always implement proper error handling when using Firecrawl:

async function safeScrap(url) {
  try {
    const result = await app.scrapeUrl(url, {
      formats: ['markdown'],
      timeout: 30000
    });

    return result;
  } catch (error) {
    if (error.response) {
      // API error
      console.error(`API Error: ${error.response.status} - ${error.response.data.error}`);
    } else if (error.request) {
      // Network error
      console.error('Network error:', error.message);
    } else {
      // Other errors
      console.error('Error:', error.message);
    }

    return null;
  }
}

Similar to how you handle errors in Puppeteer, proper error handling ensures your scraping application is robust and resilient.

Monitoring and Rate Limiting

Implement rate limiting to avoid overwhelming the API:

import pLimit from 'p-limit';

const limit = pLimit(5); // Max 5 concurrent requests

async function scrapeMultipleUrls(urls) {
  const promises = urls.map(url =>
    limit(() => app.scrapeUrl(url, { formats: ['markdown'] }))
  );

  const results = await Promise.all(promises);
  return results;
}

const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  // ... more URLs
];

scrapeMultipleUrls(urls).then(results => {
  console.log(`Scraped ${results.length} pages`);
});

Working with Maps and Sitemaps

Firecrawl can generate a map of all URLs on a website without scraping content:

async function mapWebsite() {
  const mapResult = await app.mapUrl('https://example.com', {
    search: 'product',
    limit: 500
  });

  console.log(`Found ${mapResult.links.length} URLs`);
  mapResult.links.forEach(link => console.log(link));
}

mapWebsite();

Complete Example: E-commerce Product Scraper

Here's a complete example that combines multiple Firecrawl features:

import FirecrawlApp from '@mendable/firecrawl-js';
import fs from 'fs';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

async function scrapeEcommerceProducts() {
  // Step 1: Map the website to find product URLs
  const mapResult = await app.mapUrl('https://example-shop.com', {
    search: '/product/',
    limit: 100
  });

  console.log(`Found ${mapResult.links.length} product URLs`);

  // Step 2: Extract structured data from each product
  const productSchema = {
    type: 'object',
    properties: {
      name: { type: 'string' },
      price: { type: 'number' },
      description: { type: 'string' },
      imageUrl: { type: 'string' },
      inStock: { type: 'boolean' }
    }
  };

  const products = [];

  for (const url of mapResult.links.slice(0, 10)) {
    try {
      const result = await app.scrapeUrl(url, {
        formats: ['extract'],
        extract: { schema: productSchema }
      });

      products.push({
        url,
        ...result.extract
      });

      console.log(`Scraped: ${result.extract.name}`);

      // Rate limiting
      await new Promise(resolve => setTimeout(resolve, 1000));
    } catch (error) {
      console.error(`Failed to scrape ${url}:`, error.message);
    }
  }

  // Step 3: Save results
  fs.writeFileSync('products.json', JSON.stringify(products, null, 2));
  console.log(`Saved ${products.length} products to products.json`);
}

scrapeEcommerceProducts();

Best Practices

  1. Use Environment Variables: Always store API keys in environment variables, never hardcode them
  2. Implement Rate Limiting: Respect API rate limits to avoid being throttled
  3. Handle Errors Gracefully: Implement comprehensive error handling for network issues and API errors
  4. Use Appropriate Formats: Choose the right output format for your use case (markdown for content, extract for structured data)
  5. Monitor Costs: Track your API usage to manage costs, especially for large crawling operations
  6. Cache Results: Store scraped data to avoid redundant API calls
  7. Set Timeouts: Use appropriate timeout values based on the complexity of pages you're scraping

Conclusion

Firecrawl provides a powerful and developer-friendly way to scrape web content with Node.js. Whether you need to scrape single pages, crawl entire websites, or extract structured data, Firecrawl's SDK makes it straightforward with its clean API and comprehensive features. By following the examples and best practices outlined in this guide, you can build robust web scraping applications that handle modern websites effectively.

For more advanced scenarios like handling dynamic content and JavaScript-heavy pages, consider exploring how Puppeteer handles single page applications, which can complement your Firecrawl implementation when you need even more control over browser automation.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon