Table of contents

What are the differences between Firecrawl's crawl and scrape endpoints?

Firecrawl offers two primary API endpoints for web data extraction: the scrape endpoint for single-page extraction and the crawl endpoint for multi-page website crawling. Understanding the differences between these endpoints is crucial for choosing the right tool for your web scraping project.

Overview of Firecrawl's Endpoints

The Scrape Endpoint

The scrape endpoint is designed for single-page data extraction. It fetches and processes one URL at a time, making it ideal when you need to extract data from specific pages without following links to other pages.

Key characteristics: - Processes a single URL per request - Returns data immediately (synchronous operation) - Lower API credit consumption - Perfect for targeted data extraction - Supports JavaScript rendering - Converts HTML to clean Markdown format

The Crawl Endpoint

The crawl endpoint is built for multi-page website crawling. It automatically discovers and extracts data from multiple pages on a website by following links, respecting crawl depth limits, and managing the entire crawling process.

Key characteristics: - Processes multiple URLs automatically - Asynchronous operation (returns job ID) - Higher API credit consumption - Ideal for extracting data from entire websites or sections - Supports sitemap-based crawling - Includes link discovery and URL filtering

When to Use Each Endpoint

Use the Scrape Endpoint When:

  1. Extracting data from a known URL: You have a specific page URL and need its content
  2. Real-time data needs: You require immediate results for a single page
  3. Low-volume scraping: Processing individual pages on demand
  4. API integration testing: Testing your integration with a single endpoint
  5. Monitoring specific pages: Tracking changes on particular URLs

Use the Crawl Endpoint When:

  1. Scraping entire websites: Extracting data from all product pages, blog posts, or articles
  2. Discovering content: You don't know all URLs in advance
  3. Batch processing: Processing large volumes of related pages
  4. Site archiving: Creating snapshots of entire website sections
  5. Competitive analysis: Gathering data across competitor websites

Code Examples

Using the Scrape Endpoint

Python Example

import requests

API_KEY = 'your_api_key_here'
url = 'https://api.firecrawl.dev/v0/scrape'

headers = {
    'Authorization': f'Bearer {API_KEY}',
    'Content-Type': 'application/json'
}

data = {
    'url': 'https://example.com/product/123',
    'formats': ['markdown', 'html'],
    'onlyMainContent': True
}

response = requests.post(url, json=data, headers=headers)
result = response.json()

print("Extracted content:")
print(result['data']['markdown'])

JavaScript Example

const axios = require('axios');

const API_KEY = 'your_api_key_here';
const url = 'https://api.firecrawl.dev/v0/scrape';

const scrapeData = async () => {
  try {
    const response = await axios.post(url, {
      url: 'https://example.com/product/123',
      formats: ['markdown', 'html'],
      onlyMainContent: true
    }, {
      headers: {
        'Authorization': `Bearer ${API_KEY}`,
        'Content-Type': 'application/json'
      }
    });

    console.log('Extracted content:');
    console.log(response.data.data.markdown);
  } catch (error) {
    console.error('Scraping error:', error.message);
  }
};

scrapeData();

Using the Crawl Endpoint

Python Example

import requests
import time

API_KEY = 'your_api_key_here'
crawl_url = 'https://api.firecrawl.dev/v0/crawl'
status_url = 'https://api.firecrawl.dev/v0/crawl/status'

headers = {
    'Authorization': f'Bearer {API_KEY}',
    'Content-Type': 'application/json'
}

# Start crawl job
data = {
    'url': 'https://example.com',
    'crawlerOptions': {
        'maxDepth': 3,
        'limit': 100,
        'includePaths': ['/products/*']
    },
    'pageOptions': {
        'onlyMainContent': True
    }
}

response = requests.post(crawl_url, json=data, headers=headers)
job_id = response.json()['jobId']

print(f"Crawl job started: {job_id}")

# Poll for results
while True:
    status_response = requests.get(
        f"{status_url}/{job_id}",
        headers=headers
    )
    status_data = status_response.json()

    if status_data['status'] == 'completed':
        print(f"Crawl completed! Pages found: {len(status_data['data'])}")
        for page in status_data['data']:
            print(f"URL: {page['url']}")
            print(f"Content: {page['markdown'][:100]}...")
        break
    elif status_data['status'] == 'failed':
        print("Crawl failed!")
        break

    print(f"Status: {status_data['status']}")
    time.sleep(5)

JavaScript Example

const axios = require('axios');

const API_KEY = 'your_api_key_here';
const crawlUrl = 'https://api.firecrawl.dev/v0/crawl';
const statusUrl = 'https://api.firecrawl.dev/v0/crawl/status';

const headers = {
  'Authorization': `Bearer ${API_KEY}`,
  'Content-Type': 'application/json'
};

const crawlWebsite = async () => {
  try {
    // Start crawl job
    const crawlResponse = await axios.post(crawlUrl, {
      url: 'https://example.com',
      crawlerOptions: {
        maxDepth: 3,
        limit: 100,
        includePaths: ['/products/*']
      },
      pageOptions: {
        onlyMainContent: true
      }
    }, { headers });

    const jobId = crawlResponse.data.jobId;
    console.log(`Crawl job started: ${jobId}`);

    // Poll for results
    while (true) {
      const statusResponse = await axios.get(
        `${statusUrl}/${jobId}`,
        { headers }
      );

      const status = statusResponse.data;

      if (status.status === 'completed') {
        console.log(`Crawl completed! Pages found: ${status.data.length}`);
        status.data.forEach(page => {
          console.log(`URL: ${page.url}`);
          console.log(`Content: ${page.markdown.substring(0, 100)}...`);
        });
        break;
      } else if (status.status === 'failed') {
        console.log('Crawl failed!');
        break;
      }

      console.log(`Status: ${status.status}`);
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  } catch (error) {
    console.error('Crawl error:', error.message);
  }
};

crawlWebsite();

Technical Differences

Response Format

Scrape Endpoint: - Returns synchronous response - Immediate data availability - Single page object in response

{
  "success": true,
  "data": {
    "markdown": "# Page Title\n\nContent...",
    "html": "<html>...</html>",
    "metadata": {
      "title": "Page Title",
      "description": "Page description"
    }
  }
}

Crawl Endpoint: - Returns asynchronous job ID - Requires polling for completion - Array of page objects when complete

{
  "success": true,
  "jobId": "crawl-job-123",
  "status": "processing"
}

Performance Considerations

Scrape Endpoint: - Response time: 2-10 seconds per page - No queuing delays - Suitable for real-time applications - Can be integrated with Puppeteer for browser automation

Crawl Endpoint: - Total time varies by website size - Queue-based processing - Better for batch operations - Automatically handles page navigation across the entire site

Credit Consumption

Scrape Endpoint: - 1 credit per page request - Predictable cost per operation - No overhead for link discovery

Crawl Endpoint: - Credits based on pages crawled - Additional overhead for link discovery - More cost-effective for large-scale scraping

Advanced Configuration Options

Scrape Endpoint Options

scrape_options = {
    'url': 'https://example.com/page',
    'formats': ['markdown', 'html', 'links'],  # Output formats
    'onlyMainContent': True,  # Extract main content only
    'includeTags': ['article', 'main'],  # Specific tags to include
    'excludeTags': ['nav', 'footer'],  # Tags to exclude
    'waitFor': 3000  # Wait time in milliseconds
}

Crawl Endpoint Options

crawl_options = {
    'url': 'https://example.com',
    'crawlerOptions': {
        'maxDepth': 3,  # Maximum crawl depth
        'limit': 500,  # Maximum pages to crawl
        'includePaths': ['/blog/*', '/products/*'],  # URL patterns to include
        'excludePaths': ['/admin/*'],  # URL patterns to exclude
        'allowBackwardLinks': False,  # Follow links to parent pages
        'allowExternalLinks': False  # Follow external links
    },
    'pageOptions': {
        'onlyMainContent': True,
        'formats': ['markdown']
    }
}

Error Handling

Scrape Endpoint Error Handling

try:
    response = requests.post(scrape_url, json=data, headers=headers)
    response.raise_for_status()
    result = response.json()

    if not result.get('success'):
        print(f"Scraping failed: {result.get('error')}")
except requests.exceptions.RequestException as e:
    print(f"Request error: {e}")

Crawl Endpoint Error Handling

# Check crawl status for errors
status = requests.get(f"{status_url}/{job_id}", headers=headers).json()

if status['status'] == 'failed':
    print(f"Crawl failed: {status.get('error')}")
    print(f"Failed pages: {status.get('failedPages', [])}")
elif status['status'] == 'completed':
    # Check for partial failures
    if 'failedPages' in status and len(status['failedPages']) > 0:
        print(f"Completed with {len(status['failedPages'])} failed pages")

Best Practices

For Scrape Endpoint:

  1. Implement retry logic for transient failures
  2. Cache results to minimize repeated requests
  3. Use appropriate wait times for JavaScript-heavy pages
  4. Handle rate limits by spacing requests appropriately

For Crawl Endpoint:

  1. Set appropriate depth limits to control scope
  2. Use URL patterns to filter relevant pages
  3. Monitor job status regularly for large crawls
  4. Handle partial failures gracefully
  5. Consider using sitemaps for more efficient crawling

Conclusion

The choice between Firecrawl's crawl and scrape endpoints depends on your specific needs. Use the scrape endpoint for targeted, single-page extraction with immediate results, and the crawl endpoint for comprehensive, multi-page data collection across entire websites or sections. Understanding these differences helps optimize both performance and cost for your web scraping projects.

For dynamic websites that require handling AJAX requests or complex JavaScript rendering, both endpoints support JavaScript execution, ensuring you can extract data from modern single-page applications effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon