Table of contents

How do I Handle Errors and Exceptions with the Firecrawl API?

Error handling is a critical aspect of building robust web scraping applications with Firecrawl. Whether you're scraping a single page or crawling an entire website, implementing comprehensive error handling ensures your application can gracefully recover from failures, retry operations when appropriate, and provide meaningful feedback when issues occur.

This guide covers everything you need to know about handling errors and exceptions when working with the Firecrawl API, including common error types, retry strategies, and production-ready error handling patterns.

Understanding Firecrawl Error Types

Firecrawl can encounter various types of errors during scraping operations. Understanding these error categories helps you implement appropriate handling strategies:

1. API Authentication Errors

These occur when your API key is invalid, expired, or missing:

from firecrawl import FirecrawlApp

# Example: Invalid API key
try:
    app = FirecrawlApp(api_key='invalid_key')
    result = app.scrape_url('https://example.com')
except Exception as e:
    if '401' in str(e) or 'Unauthorized' in str(e):
        print("Authentication failed. Check your API key.")
    else:
        print(f"Unexpected error: {e}")

2. Rate Limit Errors

When you exceed your API quota or request rate limits:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

try {
  const result = await app.scrapeUrl('https://example.com');
} catch (error) {
  if (error.message.includes('429') || error.message.includes('rate limit')) {
    console.error('Rate limit exceeded. Wait before retrying.');
  } else {
    console.error('Error:', error.message);
  }
}

3. Timeout Errors

When a page takes too long to load or respond. Similar to handling timeouts in Puppeteer, Firecrawl provides timeout configuration:

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='your_api_key')

try:
    result = app.scrape_url(
        'https://slow-loading-site.com',
        params={'timeout': 30000}  # 30 seconds
    )
except TimeoutError as e:
    print(f"Page load timeout: {e}")
except Exception as e:
    if 'timeout' in str(e).lower():
        print(f"Timeout occurred: {e}")

4. Network Errors

Connection failures, DNS resolution issues, or network interruptions:

async function scrapeWithNetworkHandling(url) {
  const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

  try {
    const result = await app.scrapeUrl(url);
    return result;
  } catch (error) {
    if (error.code === 'ENOTFOUND' || error.code === 'ECONNREFUSED') {
      console.error(`Network error accessing ${url}: ${error.message}`);
    } else if (error.message.includes('network')) {
      console.error('Network connectivity issue:', error.message);
    } else {
      throw error;
    }
  }
}

5. Target Website Errors

HTTP status errors from the target website (404, 500, etc.):

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

def scrape_with_status_handling(url):
    try:
        result = app.scrape_url(url)

        # Check metadata for HTTP status
        if 'statusCode' in result.get('metadata', {}):
            status_code = result['metadata']['statusCode']

            if status_code == 404:
                print(f"Page not found: {url}")
                return None
            elif status_code >= 500:
                print(f"Server error on target site: {status_code}")
                return None
            elif status_code >= 400:
                print(f"Client error: {status_code}")
                return None

        return result

    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

Implementing Retry Logic

Retry logic is essential for handling transient errors. Here's how to implement robust retry mechanisms:

Basic Retry Pattern with Exponential Backoff

from firecrawl import FirecrawlApp
import time

def scrape_with_retry(url, max_retries=3, initial_delay=1):
    """
    Scrape URL with exponential backoff retry logic

    Args:
        url: URL to scrape
        max_retries: Maximum number of retry attempts
        initial_delay: Initial delay in seconds (doubles on each retry)
    """
    app = FirecrawlApp(api_key='your_api_key')

    for attempt in range(max_retries):
        try:
            result = app.scrape_url(
                url,
                params={
                    'timeout': 30000,
                    'formats': ['markdown']
                }
            )
            print(f"Successfully scraped {url} on attempt {attempt + 1}")
            return result

        except Exception as e:
            error_message = str(e)

            # Don't retry on authentication errors
            if '401' in error_message or 'Unauthorized' in error_message:
                print(f"Authentication error - not retrying: {e}")
                raise

            # Don't retry on 404 errors
            if '404' in error_message:
                print(f"Page not found - not retrying: {url}")
                return None

            # Retry on other errors
            if attempt < max_retries - 1:
                delay = initial_delay * (2 ** attempt)  # Exponential backoff
                print(f"Attempt {attempt + 1} failed: {e}")
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                print(f"Failed after {max_retries} attempts: {e}")
                raise

# Usage
try:
    data = scrape_with_retry('https://example.com')
    if data:
        print(data['markdown'][:200])
except Exception as e:
    print(f"Scraping failed permanently: {e}")

Advanced Retry with Custom Logic

import FirecrawlApp from '@mendable/firecrawl-js';

class RetryableFirecrawl {
  constructor(apiKey, options = {}) {
    this.app = new FirecrawlApp({ apiKey });
    this.maxRetries = options.maxRetries || 3;
    this.initialDelay = options.initialDelay || 1000;
    this.maxDelay = options.maxDelay || 30000;
  }

  async scrapeWithRetry(url, params = {}) {
    let lastError;

    for (let attempt = 0; attempt < this.maxRetries; attempt++) {
      try {
        const result = await this.app.scrapeUrl(url, params);

        if (attempt > 0) {
          console.log(`Success on attempt ${attempt + 1}`);
        }

        return result;

      } catch (error) {
        lastError = error;

        // Check if error is retryable
        if (!this.isRetryableError(error)) {
          console.error(`Non-retryable error: ${error.message}`);
          throw error;
        }

        if (attempt < this.maxRetries - 1) {
          const delay = this.calculateDelay(attempt);
          console.log(`Attempt ${attempt + 1} failed: ${error.message}`);
          console.log(`Retrying in ${delay}ms...`);
          await this.sleep(delay);
        }
      }
    }

    throw new Error(`Failed after ${this.maxRetries} attempts: ${lastError.message}`);
  }

  isRetryableError(error) {
    const message = error.message.toLowerCase();

    // Don't retry authentication errors
    if (message.includes('401') || message.includes('unauthorized')) {
      return false;
    }

    // Don't retry 404 errors
    if (message.includes('404') || message.includes('not found')) {
      return false;
    }

    // Don't retry 400 bad request errors
    if (message.includes('400') || message.includes('bad request')) {
      return false;
    }

    // Retry timeouts, rate limits, and server errors
    return (
      message.includes('timeout') ||
      message.includes('429') ||
      message.includes('rate limit') ||
      message.includes('500') ||
      message.includes('502') ||
      message.includes('503') ||
      message.includes('network')
    );
  }

  calculateDelay(attempt) {
    const exponentialDelay = this.initialDelay * Math.pow(2, attempt);
    const jitter = Math.random() * 1000; // Add random jitter
    return Math.min(exponentialDelay + jitter, this.maxDelay);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const scraper = new RetryableFirecrawl(process.env.FIRECRAWL_API_KEY, {
  maxRetries: 5,
  initialDelay: 2000,
  maxDelay: 60000
});

try {
  const result = await scraper.scrapeWithRetry('https://example.com', {
    formats: ['markdown'],
    timeout: 30000
  });
  console.log('Success:', result.markdown.substring(0, 200));
} catch (error) {
  console.error('Scraping failed:', error.message);
}

Handling Crawl Job Errors

When crawling multiple pages, error handling becomes more complex. You need to handle both individual page errors and overall crawl job failures:

from firecrawl import FirecrawlApp
import time

class CrawlErrorHandler:
    def __init__(self, api_key):
        self.app = FirecrawlApp(api_key=api_key)
        self.failed_urls = []
        self.successful_urls = []

    def crawl_with_error_tracking(self, base_url, params=None):
        """
        Crawl with comprehensive error tracking
        """
        if params is None:
            params = {}

        params.setdefault('limit', 100)
        params.setdefault('timeout', 30000)

        try:
            # Start the crawl
            print(f"Starting crawl of {base_url}")
            crawl_result = self.app.crawl_url(
                base_url,
                params=params,
                poll_interval=5
            )

            # Process results
            if 'data' in crawl_result:
                for page in crawl_result['data']:
                    url = page.get('metadata', {}).get('sourceURL', 'unknown')

                    # Check for errors in individual pages
                    if 'error' in page:
                        self.failed_urls.append({
                            'url': url,
                            'error': page['error']
                        })
                        print(f"Failed to scrape {url}: {page['error']}")
                    else:
                        self.successful_urls.append(url)
                        print(f"Successfully scraped {url}")

            return crawl_result

        except Exception as e:
            print(f"Crawl job failed: {e}")
            return None

    def get_statistics(self):
        """Return crawl statistics"""
        total = len(self.successful_urls) + len(self.failed_urls)
        success_rate = (len(self.successful_urls) / total * 100) if total > 0 else 0

        return {
            'total_pages': total,
            'successful': len(self.successful_urls),
            'failed': len(self.failed_urls),
            'success_rate': f"{success_rate:.2f}%",
            'failed_urls': self.failed_urls
        }

# Usage
handler = CrawlErrorHandler(api_key='your_api_key')

result = handler.crawl_with_error_tracking(
    'https://example.com',
    params={
        'limit': 50,
        'includePaths': ['/blog/*'],
        'timeout': 30000
    }
)

# Print statistics
stats = handler.get_statistics()
print(f"\nCrawl Statistics:")
print(f"Total pages: {stats['total_pages']}")
print(f"Successful: {stats['successful']}")
print(f"Failed: {stats['failed']}")
print(f"Success rate: {stats['success_rate']}")

if stats['failed_urls']:
    print("\nFailed URLs:")
    for failure in stats['failed_urls']:
        print(f"  - {failure['url']}: {failure['error']}")

Rate Limit Handling

Rate limiting is a common challenge when working with APIs. Here's how to handle it effectively, similar to approaches used when handling errors in Puppeteer:

import FirecrawlApp from '@mendable/firecrawl-js';

class RateLimitHandler {
  constructor(apiKey, options = {}) {
    this.app = new FirecrawlApp({ apiKey });
    this.requestQueue = [];
    this.requestsPerMinute = options.requestsPerMinute || 60;
    this.minRequestInterval = 60000 / this.requestsPerMinute;
    this.lastRequestTime = 0;
  }

  async scrapeWithRateLimit(url, params = {}) {
    // Wait if necessary to respect rate limit
    await this.waitForRateLimit();

    try {
      const result = await this.app.scrapeUrl(url, params);
      this.lastRequestTime = Date.now();
      return result;

    } catch (error) {
      if (this.isRateLimitError(error)) {
        console.log('Rate limit hit. Waiting before retry...');

        // Extract retry-after header if available
        const retryAfter = this.extractRetryAfter(error);
        const waitTime = retryAfter || 60000; // Default to 60 seconds

        console.log(`Waiting ${waitTime}ms before retry`);
        await this.sleep(waitTime);

        // Retry once after waiting
        return await this.app.scrapeUrl(url, params);
      }

      throw error;
    }
  }

  async waitForRateLimit() {
    const now = Date.now();
    const timeSinceLastRequest = now - this.lastRequestTime;

    if (timeSinceLastRequest < this.minRequestInterval) {
      const waitTime = this.minRequestInterval - timeSinceLastRequest;
      console.log(`Rate limiting: waiting ${waitTime}ms`);
      await this.sleep(waitTime);
    }
  }

  isRateLimitError(error) {
    const message = error.message.toLowerCase();
    return (
      message.includes('429') ||
      message.includes('rate limit') ||
      message.includes('too many requests')
    );
  }

  extractRetryAfter(error) {
    // Try to extract Retry-After header value
    if (error.response && error.response.headers) {
      const retryAfter = error.response.headers['retry-after'];
      if (retryAfter) {
        return parseInt(retryAfter) * 1000; // Convert seconds to milliseconds
      }
    }
    return null;
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const scraper = new RateLimitHandler(process.env.FIRECRAWL_API_KEY, {
  requestsPerMinute: 30 // Conservative rate limit
});

const urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
];

for (const url of urls) {
  try {
    const result = await scraper.scrapeWithRateLimit(url, {
      formats: ['markdown']
    });
    console.log(`Scraped ${url}:`, result.markdown.substring(0, 100));
  } catch (error) {
    console.error(`Failed to scrape ${url}:`, error.message);
  }
}

Production-Ready Error Handling Pattern

Here's a comprehensive, production-ready error handling implementation:

from firecrawl import FirecrawlApp
import logging
import time
from typing import Optional, Dict, Any
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class ProductionFirecrawlClient:
    """Production-ready Firecrawl client with comprehensive error handling"""

    def __init__(self, api_key: str, max_retries: int = 3):
        self.app = FirecrawlApp(api_key=api_key)
        self.max_retries = max_retries
        self.stats = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'retried_requests': 0
        }

    def scrape(
        self,
        url: str,
        params: Optional[Dict[str, Any]] = None,
        retry: bool = True
    ) -> Optional[Dict[str, Any]]:
        """
        Scrape a URL with comprehensive error handling

        Args:
            url: URL to scrape
            params: Scraping parameters
            retry: Whether to retry on failure

        Returns:
            Scraped data or None on failure
        """
        self.stats['total_requests'] += 1

        if params is None:
            params = {}

        # Set default timeout if not specified
        params.setdefault('timeout', 30000)

        retries = self.max_retries if retry else 1

        for attempt in range(retries):
            try:
                logger.info(f"Scraping {url} (attempt {attempt + 1}/{retries})")

                result = self.app.scrape_url(url, params=params)

                # Validate result
                if not result or 'markdown' not in result:
                    logger.warning(f"Invalid result structure for {url}")
                    if attempt < retries - 1:
                        continue
                    return None

                self.stats['successful_requests'] += 1
                if attempt > 0:
                    self.stats['retried_requests'] += 1

                logger.info(f"Successfully scraped {url}")
                return result

            except Exception as e:
                error_type = self._classify_error(e)
                logger.error(f"Error scraping {url}: {error_type} - {str(e)}")

                # Handle specific error types
                if error_type == 'auth_error':
                    logger.critical("Authentication error - check API key")
                    self.stats['failed_requests'] += 1
                    return None

                elif error_type == 'not_found':
                    logger.warning(f"Page not found: {url}")
                    self.stats['failed_requests'] += 1
                    return None

                elif error_type == 'rate_limit':
                    if attempt < retries - 1:
                        wait_time = 60  # Wait 60 seconds for rate limit
                        logger.info(f"Rate limit hit. Waiting {wait_time}s...")
                        time.sleep(wait_time)
                        continue

                elif error_type == 'timeout':
                    if attempt < retries - 1:
                        # Increase timeout on retry
                        params['timeout'] = min(params['timeout'] * 1.5, 90000)
                        delay = 2 ** attempt
                        logger.info(f"Timeout. Retrying with {params['timeout']}ms timeout in {delay}s")
                        time.sleep(delay)
                        continue

                elif error_type == 'server_error':
                    if attempt < retries - 1:
                        delay = 2 ** attempt  # Exponential backoff
                        logger.info(f"Server error. Retrying in {delay}s...")
                        time.sleep(delay)
                        continue

                # Last attempt failed
                if attempt == retries - 1:
                    logger.error(f"Failed to scrape {url} after {retries} attempts")
                    self.stats['failed_requests'] += 1
                    return None

        return None

    def _classify_error(self, error: Exception) -> str:
        """Classify error type for appropriate handling"""
        error_str = str(error).lower()

        if '401' in error_str or 'unauthorized' in error_str:
            return 'auth_error'
        elif '404' in error_str or 'not found' in error_str:
            return 'not_found'
        elif '429' in error_str or 'rate limit' in error_str:
            return 'rate_limit'
        elif 'timeout' in error_str:
            return 'timeout'
        elif '500' in error_str or '502' in error_str or '503' in error_str:
            return 'server_error'
        elif 'network' in error_str or 'connection' in error_str:
            return 'network_error'
        else:
            return 'unknown_error'

    def get_stats(self) -> Dict[str, Any]:
        """Get scraping statistics"""
        success_rate = (
            self.stats['successful_requests'] / self.stats['total_requests'] * 100
            if self.stats['total_requests'] > 0
            else 0
        )

        return {
            **self.stats,
            'success_rate': f"{success_rate:.2f}%"
        }

# Usage example
client = ProductionFirecrawlClient(api_key='your_api_key', max_retries=3)

urls = [
    'https://example.com',
    'https://example.com/about',
    'https://example.com/contact'
]

results = []
for url in urls:
    result = client.scrape(
        url,
        params={
            'formats': ['markdown'],
            'onlyMainContent': True
        }
    )

    if result:
        results.append({
            'url': url,
            'markdown': result['markdown'],
            'timestamp': datetime.now().isoformat()
        })

# Print statistics
print("\nScraping Statistics:")
stats = client.get_stats()
for key, value in stats.items():
    print(f"{key}: {value}")

Best Practices for Error Handling

1. Log Everything

Maintain comprehensive logs for debugging and monitoring:

import logging

logging.basicConfig(
    filename='firecrawl_scraper.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger(__name__)

try:
    result = app.scrape_url(url)
    logger.info(f"Successfully scraped {url}")
except Exception as e:
    logger.error(f"Failed to scrape {url}: {e}", exc_info=True)

2. Use Circuit Breaker Pattern

Prevent cascading failures when a service is consistently failing:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now();
  }

  async execute(operation) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
      console.log(`Circuit breaker opened. Will retry after ${this.timeout}ms`);
    }
  }
}

3. Implement Graceful Degradation

Provide fallback behavior when scraping fails:

def scrape_with_fallback(url, fallback_data=None):
    """Scrape with fallback to cached or default data"""
    try:
        result = app.scrape_url(url)
        return result
    except Exception as e:
        logger.warning(f"Scraping failed, using fallback: {e}")
        return fallback_data or {'markdown': 'Content unavailable', 'error': str(e)}

4. Monitor and Alert

Set up monitoring for critical errors:

def scrape_with_alerts(url, alert_callback=None):
    """Scrape with alerting on critical errors"""
    try:
        result = app.scrape_url(url)
        return result
    except Exception as e:
        if '401' in str(e) or '500' in str(e):
            # Critical error - send alert
            if alert_callback:
                alert_callback(f"Critical error scraping {url}: {e}")
        raise

Handling Browser Automation Errors

When Firecrawl uses browser automation internally, you might encounter errors similar to those in handling browser events in Puppeteer. These can include JavaScript execution errors or navigation failures:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

def scrape_dynamic_page_with_error_handling(url):
    """Handle errors specific to JavaScript-heavy pages"""
    try:
        result = app.scrape_url(
            url,
            params={
                'formats': ['markdown'],
                'waitFor': 5000,  # Wait for JavaScript
                'timeout': 45000  # Longer timeout for dynamic content
            }
        )
        return result

    except Exception as e:
        error_str = str(e).lower()

        if 'javascript' in error_str or 'script error' in error_str:
            print(f"JavaScript execution error on {url}: {e}")
            # Try again with longer wait time
            try:
                return app.scrape_url(
                    url,
                    params={
                        'formats': ['markdown'],
                        'waitFor': 10000,  # Longer wait
                        'timeout': 60000
                    }
                )
            except:
                print("Failed even with extended wait time")
                return None

        elif 'navigation' in error_str:
            print(f"Navigation error on {url}: {e}")
            return None

        else:
            raise

Conclusion

Effective error handling is essential for building reliable web scraping applications with Firecrawl. By implementing proper retry logic, classifying errors appropriately, handling rate limits gracefully, and maintaining comprehensive logs, you can create robust scrapers that handle failures elegantly and recover automatically when possible.

Remember to: - Classify errors and handle them appropriately - Implement exponential backoff for retries - Respect rate limits and use appropriate delays - Log all errors for debugging and monitoring - Use circuit breakers to prevent cascading failures - Provide graceful degradation when scraping fails

With these patterns and best practices, your Firecrawl-based scraping applications will be production-ready and resilient to the various errors and exceptions that can occur during web scraping operations.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon