How do I Handle Errors and Exceptions with the Firecrawl API?
Error handling is a critical aspect of building robust web scraping applications with Firecrawl. Whether you're scraping a single page or crawling an entire website, implementing comprehensive error handling ensures your application can gracefully recover from failures, retry operations when appropriate, and provide meaningful feedback when issues occur.
This guide covers everything you need to know about handling errors and exceptions when working with the Firecrawl API, including common error types, retry strategies, and production-ready error handling patterns.
Understanding Firecrawl Error Types
Firecrawl can encounter various types of errors during scraping operations. Understanding these error categories helps you implement appropriate handling strategies:
1. API Authentication Errors
These occur when your API key is invalid, expired, or missing:
from firecrawl import FirecrawlApp
# Example: Invalid API key
try:
app = FirecrawlApp(api_key='invalid_key')
result = app.scrape_url('https://example.com')
except Exception as e:
if '401' in str(e) or 'Unauthorized' in str(e):
print("Authentication failed. Check your API key.")
else:
print(f"Unexpected error: {e}")
2. Rate Limit Errors
When you exceed your API quota or request rate limits:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
try {
const result = await app.scrapeUrl('https://example.com');
} catch (error) {
if (error.message.includes('429') || error.message.includes('rate limit')) {
console.error('Rate limit exceeded. Wait before retrying.');
} else {
console.error('Error:', error.message);
}
}
3. Timeout Errors
When a page takes too long to load or respond. Similar to handling timeouts in Puppeteer, Firecrawl provides timeout configuration:
from firecrawl import FirecrawlApp
import time
app = FirecrawlApp(api_key='your_api_key')
try:
result = app.scrape_url(
'https://slow-loading-site.com',
params={'timeout': 30000} # 30 seconds
)
except TimeoutError as e:
print(f"Page load timeout: {e}")
except Exception as e:
if 'timeout' in str(e).lower():
print(f"Timeout occurred: {e}")
4. Network Errors
Connection failures, DNS resolution issues, or network interruptions:
async function scrapeWithNetworkHandling(url) {
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
try {
const result = await app.scrapeUrl(url);
return result;
} catch (error) {
if (error.code === 'ENOTFOUND' || error.code === 'ECONNREFUSED') {
console.error(`Network error accessing ${url}: ${error.message}`);
} else if (error.message.includes('network')) {
console.error('Network connectivity issue:', error.message);
} else {
throw error;
}
}
}
5. Target Website Errors
HTTP status errors from the target website (404, 500, etc.):
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
def scrape_with_status_handling(url):
try:
result = app.scrape_url(url)
# Check metadata for HTTP status
if 'statusCode' in result.get('metadata', {}):
status_code = result['metadata']['statusCode']
if status_code == 404:
print(f"Page not found: {url}")
return None
elif status_code >= 500:
print(f"Server error on target site: {status_code}")
return None
elif status_code >= 400:
print(f"Client error: {status_code}")
return None
return result
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
Implementing Retry Logic
Retry logic is essential for handling transient errors. Here's how to implement robust retry mechanisms:
Basic Retry Pattern with Exponential Backoff
from firecrawl import FirecrawlApp
import time
def scrape_with_retry(url, max_retries=3, initial_delay=1):
"""
Scrape URL with exponential backoff retry logic
Args:
url: URL to scrape
max_retries: Maximum number of retry attempts
initial_delay: Initial delay in seconds (doubles on each retry)
"""
app = FirecrawlApp(api_key='your_api_key')
for attempt in range(max_retries):
try:
result = app.scrape_url(
url,
params={
'timeout': 30000,
'formats': ['markdown']
}
)
print(f"Successfully scraped {url} on attempt {attempt + 1}")
return result
except Exception as e:
error_message = str(e)
# Don't retry on authentication errors
if '401' in error_message or 'Unauthorized' in error_message:
print(f"Authentication error - not retrying: {e}")
raise
# Don't retry on 404 errors
if '404' in error_message:
print(f"Page not found - not retrying: {url}")
return None
# Retry on other errors
if attempt < max_retries - 1:
delay = initial_delay * (2 ** attempt) # Exponential backoff
print(f"Attempt {attempt + 1} failed: {e}")
print(f"Retrying in {delay} seconds...")
time.sleep(delay)
else:
print(f"Failed after {max_retries} attempts: {e}")
raise
# Usage
try:
data = scrape_with_retry('https://example.com')
if data:
print(data['markdown'][:200])
except Exception as e:
print(f"Scraping failed permanently: {e}")
Advanced Retry with Custom Logic
import FirecrawlApp from '@mendable/firecrawl-js';
class RetryableFirecrawl {
constructor(apiKey, options = {}) {
this.app = new FirecrawlApp({ apiKey });
this.maxRetries = options.maxRetries || 3;
this.initialDelay = options.initialDelay || 1000;
this.maxDelay = options.maxDelay || 30000;
}
async scrapeWithRetry(url, params = {}) {
let lastError;
for (let attempt = 0; attempt < this.maxRetries; attempt++) {
try {
const result = await this.app.scrapeUrl(url, params);
if (attempt > 0) {
console.log(`Success on attempt ${attempt + 1}`);
}
return result;
} catch (error) {
lastError = error;
// Check if error is retryable
if (!this.isRetryableError(error)) {
console.error(`Non-retryable error: ${error.message}`);
throw error;
}
if (attempt < this.maxRetries - 1) {
const delay = this.calculateDelay(attempt);
console.log(`Attempt ${attempt + 1} failed: ${error.message}`);
console.log(`Retrying in ${delay}ms...`);
await this.sleep(delay);
}
}
}
throw new Error(`Failed after ${this.maxRetries} attempts: ${lastError.message}`);
}
isRetryableError(error) {
const message = error.message.toLowerCase();
// Don't retry authentication errors
if (message.includes('401') || message.includes('unauthorized')) {
return false;
}
// Don't retry 404 errors
if (message.includes('404') || message.includes('not found')) {
return false;
}
// Don't retry 400 bad request errors
if (message.includes('400') || message.includes('bad request')) {
return false;
}
// Retry timeouts, rate limits, and server errors
return (
message.includes('timeout') ||
message.includes('429') ||
message.includes('rate limit') ||
message.includes('500') ||
message.includes('502') ||
message.includes('503') ||
message.includes('network')
);
}
calculateDelay(attempt) {
const exponentialDelay = this.initialDelay * Math.pow(2, attempt);
const jitter = Math.random() * 1000; // Add random jitter
return Math.min(exponentialDelay + jitter, this.maxDelay);
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const scraper = new RetryableFirecrawl(process.env.FIRECRAWL_API_KEY, {
maxRetries: 5,
initialDelay: 2000,
maxDelay: 60000
});
try {
const result = await scraper.scrapeWithRetry('https://example.com', {
formats: ['markdown'],
timeout: 30000
});
console.log('Success:', result.markdown.substring(0, 200));
} catch (error) {
console.error('Scraping failed:', error.message);
}
Handling Crawl Job Errors
When crawling multiple pages, error handling becomes more complex. You need to handle both individual page errors and overall crawl job failures:
from firecrawl import FirecrawlApp
import time
class CrawlErrorHandler:
def __init__(self, api_key):
self.app = FirecrawlApp(api_key=api_key)
self.failed_urls = []
self.successful_urls = []
def crawl_with_error_tracking(self, base_url, params=None):
"""
Crawl with comprehensive error tracking
"""
if params is None:
params = {}
params.setdefault('limit', 100)
params.setdefault('timeout', 30000)
try:
# Start the crawl
print(f"Starting crawl of {base_url}")
crawl_result = self.app.crawl_url(
base_url,
params=params,
poll_interval=5
)
# Process results
if 'data' in crawl_result:
for page in crawl_result['data']:
url = page.get('metadata', {}).get('sourceURL', 'unknown')
# Check for errors in individual pages
if 'error' in page:
self.failed_urls.append({
'url': url,
'error': page['error']
})
print(f"Failed to scrape {url}: {page['error']}")
else:
self.successful_urls.append(url)
print(f"Successfully scraped {url}")
return crawl_result
except Exception as e:
print(f"Crawl job failed: {e}")
return None
def get_statistics(self):
"""Return crawl statistics"""
total = len(self.successful_urls) + len(self.failed_urls)
success_rate = (len(self.successful_urls) / total * 100) if total > 0 else 0
return {
'total_pages': total,
'successful': len(self.successful_urls),
'failed': len(self.failed_urls),
'success_rate': f"{success_rate:.2f}%",
'failed_urls': self.failed_urls
}
# Usage
handler = CrawlErrorHandler(api_key='your_api_key')
result = handler.crawl_with_error_tracking(
'https://example.com',
params={
'limit': 50,
'includePaths': ['/blog/*'],
'timeout': 30000
}
)
# Print statistics
stats = handler.get_statistics()
print(f"\nCrawl Statistics:")
print(f"Total pages: {stats['total_pages']}")
print(f"Successful: {stats['successful']}")
print(f"Failed: {stats['failed']}")
print(f"Success rate: {stats['success_rate']}")
if stats['failed_urls']:
print("\nFailed URLs:")
for failure in stats['failed_urls']:
print(f" - {failure['url']}: {failure['error']}")
Rate Limit Handling
Rate limiting is a common challenge when working with APIs. Here's how to handle it effectively, similar to approaches used when handling errors in Puppeteer:
import FirecrawlApp from '@mendable/firecrawl-js';
class RateLimitHandler {
constructor(apiKey, options = {}) {
this.app = new FirecrawlApp({ apiKey });
this.requestQueue = [];
this.requestsPerMinute = options.requestsPerMinute || 60;
this.minRequestInterval = 60000 / this.requestsPerMinute;
this.lastRequestTime = 0;
}
async scrapeWithRateLimit(url, params = {}) {
// Wait if necessary to respect rate limit
await this.waitForRateLimit();
try {
const result = await this.app.scrapeUrl(url, params);
this.lastRequestTime = Date.now();
return result;
} catch (error) {
if (this.isRateLimitError(error)) {
console.log('Rate limit hit. Waiting before retry...');
// Extract retry-after header if available
const retryAfter = this.extractRetryAfter(error);
const waitTime = retryAfter || 60000; // Default to 60 seconds
console.log(`Waiting ${waitTime}ms before retry`);
await this.sleep(waitTime);
// Retry once after waiting
return await this.app.scrapeUrl(url, params);
}
throw error;
}
}
async waitForRateLimit() {
const now = Date.now();
const timeSinceLastRequest = now - this.lastRequestTime;
if (timeSinceLastRequest < this.minRequestInterval) {
const waitTime = this.minRequestInterval - timeSinceLastRequest;
console.log(`Rate limiting: waiting ${waitTime}ms`);
await this.sleep(waitTime);
}
}
isRateLimitError(error) {
const message = error.message.toLowerCase();
return (
message.includes('429') ||
message.includes('rate limit') ||
message.includes('too many requests')
);
}
extractRetryAfter(error) {
// Try to extract Retry-After header value
if (error.response && error.response.headers) {
const retryAfter = error.response.headers['retry-after'];
if (retryAfter) {
return parseInt(retryAfter) * 1000; // Convert seconds to milliseconds
}
}
return null;
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const scraper = new RateLimitHandler(process.env.FIRECRAWL_API_KEY, {
requestsPerMinute: 30 // Conservative rate limit
});
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
for (const url of urls) {
try {
const result = await scraper.scrapeWithRateLimit(url, {
formats: ['markdown']
});
console.log(`Scraped ${url}:`, result.markdown.substring(0, 100));
} catch (error) {
console.error(`Failed to scrape ${url}:`, error.message);
}
}
Production-Ready Error Handling Pattern
Here's a comprehensive, production-ready error handling implementation:
from firecrawl import FirecrawlApp
import logging
import time
from typing import Optional, Dict, Any
from datetime import datetime
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class ProductionFirecrawlClient:
"""Production-ready Firecrawl client with comprehensive error handling"""
def __init__(self, api_key: str, max_retries: int = 3):
self.app = FirecrawlApp(api_key=api_key)
self.max_retries = max_retries
self.stats = {
'total_requests': 0,
'successful_requests': 0,
'failed_requests': 0,
'retried_requests': 0
}
def scrape(
self,
url: str,
params: Optional[Dict[str, Any]] = None,
retry: bool = True
) -> Optional[Dict[str, Any]]:
"""
Scrape a URL with comprehensive error handling
Args:
url: URL to scrape
params: Scraping parameters
retry: Whether to retry on failure
Returns:
Scraped data or None on failure
"""
self.stats['total_requests'] += 1
if params is None:
params = {}
# Set default timeout if not specified
params.setdefault('timeout', 30000)
retries = self.max_retries if retry else 1
for attempt in range(retries):
try:
logger.info(f"Scraping {url} (attempt {attempt + 1}/{retries})")
result = self.app.scrape_url(url, params=params)
# Validate result
if not result or 'markdown' not in result:
logger.warning(f"Invalid result structure for {url}")
if attempt < retries - 1:
continue
return None
self.stats['successful_requests'] += 1
if attempt > 0:
self.stats['retried_requests'] += 1
logger.info(f"Successfully scraped {url}")
return result
except Exception as e:
error_type = self._classify_error(e)
logger.error(f"Error scraping {url}: {error_type} - {str(e)}")
# Handle specific error types
if error_type == 'auth_error':
logger.critical("Authentication error - check API key")
self.stats['failed_requests'] += 1
return None
elif error_type == 'not_found':
logger.warning(f"Page not found: {url}")
self.stats['failed_requests'] += 1
return None
elif error_type == 'rate_limit':
if attempt < retries - 1:
wait_time = 60 # Wait 60 seconds for rate limit
logger.info(f"Rate limit hit. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
elif error_type == 'timeout':
if attempt < retries - 1:
# Increase timeout on retry
params['timeout'] = min(params['timeout'] * 1.5, 90000)
delay = 2 ** attempt
logger.info(f"Timeout. Retrying with {params['timeout']}ms timeout in {delay}s")
time.sleep(delay)
continue
elif error_type == 'server_error':
if attempt < retries - 1:
delay = 2 ** attempt # Exponential backoff
logger.info(f"Server error. Retrying in {delay}s...")
time.sleep(delay)
continue
# Last attempt failed
if attempt == retries - 1:
logger.error(f"Failed to scrape {url} after {retries} attempts")
self.stats['failed_requests'] += 1
return None
return None
def _classify_error(self, error: Exception) -> str:
"""Classify error type for appropriate handling"""
error_str = str(error).lower()
if '401' in error_str or 'unauthorized' in error_str:
return 'auth_error'
elif '404' in error_str or 'not found' in error_str:
return 'not_found'
elif '429' in error_str or 'rate limit' in error_str:
return 'rate_limit'
elif 'timeout' in error_str:
return 'timeout'
elif '500' in error_str or '502' in error_str or '503' in error_str:
return 'server_error'
elif 'network' in error_str or 'connection' in error_str:
return 'network_error'
else:
return 'unknown_error'
def get_stats(self) -> Dict[str, Any]:
"""Get scraping statistics"""
success_rate = (
self.stats['successful_requests'] / self.stats['total_requests'] * 100
if self.stats['total_requests'] > 0
else 0
)
return {
**self.stats,
'success_rate': f"{success_rate:.2f}%"
}
# Usage example
client = ProductionFirecrawlClient(api_key='your_api_key', max_retries=3)
urls = [
'https://example.com',
'https://example.com/about',
'https://example.com/contact'
]
results = []
for url in urls:
result = client.scrape(
url,
params={
'formats': ['markdown'],
'onlyMainContent': True
}
)
if result:
results.append({
'url': url,
'markdown': result['markdown'],
'timestamp': datetime.now().isoformat()
})
# Print statistics
print("\nScraping Statistics:")
stats = client.get_stats()
for key, value in stats.items():
print(f"{key}: {value}")
Best Practices for Error Handling
1. Log Everything
Maintain comprehensive logs for debugging and monitoring:
import logging
logging.basicConfig(
filename='firecrawl_scraper.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
try:
result = app.scrape_url(url)
logger.info(f"Successfully scraped {url}")
except Exception as e:
logger.error(f"Failed to scrape {url}: {e}", exc_info=True)
2. Use Circuit Breaker Pattern
Prevent cascading failures when a service is consistently failing:
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failureCount = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now();
}
async execute(operation) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
console.log(`Circuit breaker opened. Will retry after ${this.timeout}ms`);
}
}
}
3. Implement Graceful Degradation
Provide fallback behavior when scraping fails:
def scrape_with_fallback(url, fallback_data=None):
"""Scrape with fallback to cached or default data"""
try:
result = app.scrape_url(url)
return result
except Exception as e:
logger.warning(f"Scraping failed, using fallback: {e}")
return fallback_data or {'markdown': 'Content unavailable', 'error': str(e)}
4. Monitor and Alert
Set up monitoring for critical errors:
def scrape_with_alerts(url, alert_callback=None):
"""Scrape with alerting on critical errors"""
try:
result = app.scrape_url(url)
return result
except Exception as e:
if '401' in str(e) or '500' in str(e):
# Critical error - send alert
if alert_callback:
alert_callback(f"Critical error scraping {url}: {e}")
raise
Handling Browser Automation Errors
When Firecrawl uses browser automation internally, you might encounter errors similar to those in handling browser events in Puppeteer. These can include JavaScript execution errors or navigation failures:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
def scrape_dynamic_page_with_error_handling(url):
"""Handle errors specific to JavaScript-heavy pages"""
try:
result = app.scrape_url(
url,
params={
'formats': ['markdown'],
'waitFor': 5000, # Wait for JavaScript
'timeout': 45000 # Longer timeout for dynamic content
}
)
return result
except Exception as e:
error_str = str(e).lower()
if 'javascript' in error_str or 'script error' in error_str:
print(f"JavaScript execution error on {url}: {e}")
# Try again with longer wait time
try:
return app.scrape_url(
url,
params={
'formats': ['markdown'],
'waitFor': 10000, # Longer wait
'timeout': 60000
}
)
except:
print("Failed even with extended wait time")
return None
elif 'navigation' in error_str:
print(f"Navigation error on {url}: {e}")
return None
else:
raise
Conclusion
Effective error handling is essential for building reliable web scraping applications with Firecrawl. By implementing proper retry logic, classifying errors appropriately, handling rate limits gracefully, and maintaining comprehensive logs, you can create robust scrapers that handle failures elegantly and recover automatically when possible.
Remember to: - Classify errors and handle them appropriately - Implement exponential backoff for retries - Respect rate limits and use appropriate delays - Log all errors for debugging and monitoring - Use circuit breakers to prevent cascading failures - Provide graceful degradation when scraping fails
With these patterns and best practices, your Firecrawl-based scraping applications will be production-ready and resilient to the various errors and exceptions that can occur during web scraping operations.