How do I Handle Timeouts When Using Firecrawl?
Timeout handling is crucial when working with Firecrawl, especially when scraping slow-loading websites or dealing with dynamic content. Properly configured timeouts ensure your scraping operations don't hang indefinitely while giving pages enough time to load completely.
Understanding Firecrawl Timeout Parameters
Firecrawl provides several timeout-related parameters that control how long the scraper waits for various operations. The main timeout parameter is timeout
, which specifies the maximum time in milliseconds to wait for a page to load.
Basic Timeout Configuration
Here's how to set a basic timeout in Firecrawl using Python:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
# Set a 30-second timeout
result = app.scrape_url(
'https://example.com',
params={
'timeout': 30000 # 30 seconds in milliseconds
}
)
print(result)
In JavaScript/Node.js:
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
// Set a 30-second timeout
const result = await app.scrapeUrl('https://example.com', {
timeout: 30000 // 30 seconds in milliseconds
});
console.log(result);
Timeout Configuration Options
Firecrawl supports multiple timeout-related settings depending on the operation you're performing:
1. Page Load Timeout
This is the primary timeout that controls how long Firecrawl waits for a page to load completely:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
result = app.scrape_url(
'https://slow-loading-site.com',
params={
'timeout': 60000, # 60 seconds for slow sites
'waitFor': 5000 # Additional wait after page load
}
)
2. Wait For Specific Elements
When scraping dynamic content, you might need to wait for specific elements to appear. Similar to handling AJAX requests using Puppeteer, Firecrawl allows you to specify selectors to wait for:
const result = await app.scrapeUrl('https://dynamic-site.com', {
timeout: 30000,
waitFor: 'selector:.data-loaded', // Wait for specific element
});
3. Crawl Job Timeouts
When using Firecrawl's crawling feature to scrape multiple pages, you can set timeouts for the entire crawl job:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
crawl_result = app.crawl_url(
'https://example.com',
params={
'timeout': 45000, # Per-page timeout
'crawlTimeout': 300000 # Total crawl timeout (5 minutes)
}
)
Implementing Robust Timeout Error Handling
Timeouts can occur for various reasons, and proper error handling ensures your application handles them gracefully:
Python Error Handling
from firecrawl import FirecrawlApp
import time
app = FirecrawlApp(api_key='your_api_key')
def scrape_with_retry(url, max_retries=3):
"""Scrape with automatic retry on timeout"""
for attempt in range(max_retries):
try:
result = app.scrape_url(
url,
params={
'timeout': 30000,
'waitFor': 3000
}
)
return result
except TimeoutError as e:
print(f"Timeout on attempt {attempt + 1}: {e}")
if attempt < max_retries - 1:
# Exponential backoff
wait_time = 2 ** attempt
print(f"Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
print("Max retries reached. Scraping failed.")
raise
except Exception as e:
print(f"Unexpected error: {e}")
raise
# Usage
try:
data = scrape_with_retry('https://example.com')
print("Scraping successful:", data)
except Exception as e:
print(f"Failed to scrape: {e}")
JavaScript/Node.js Error Handling
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'your_api_key' });
async function scrapeWithRetry(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const result = await app.scrapeUrl(url, {
timeout: 30000,
waitFor: 3000
});
return result;
} catch (error) {
if (error.message.includes('timeout')) {
console.log(`Timeout on attempt ${attempt + 1}: ${error.message}`);
if (attempt < maxRetries - 1) {
const waitTime = Math.pow(2, attempt) * 1000;
console.log(`Retrying in ${waitTime/1000} seconds...`);
await new Promise(resolve => setTimeout(resolve, waitTime));
} else {
console.log('Max retries reached. Scraping failed.');
throw error;
}
} else {
console.log(`Unexpected error: ${error.message}`);
throw error;
}
}
}
}
// Usage
try {
const data = await scrapeWithRetry('https://example.com');
console.log('Scraping successful:', data);
} catch (error) {
console.error('Failed to scrape:', error);
}
Best Practices for Timeout Configuration
1. Choose Appropriate Timeout Values
Different types of websites require different timeout values:
# Fast static sites
static_site_params = {
'timeout': 15000 # 15 seconds
}
# Dynamic sites with JavaScript
dynamic_site_params = {
'timeout': 30000, # 30 seconds
'waitFor': 5000 # Wait 5 seconds after load
}
# Very slow or heavy sites
heavy_site_params = {
'timeout': 60000, # 60 seconds
'waitFor': 10000 # Wait 10 seconds after load
}
2. Implement Progressive Timeout Strategy
Start with shorter timeouts and increase progressively on retry:
def progressive_timeout_scrape(url):
timeouts = [20000, 40000, 60000] # Progressive timeouts
for timeout in timeouts:
try:
result = app.scrape_url(
url,
params={'timeout': timeout}
)
return result
except TimeoutError:
if timeout == timeouts[-1]:
raise
print(f"Timeout at {timeout}ms, trying longer timeout...")
continue
3. Use Timeout Monitoring and Logging
Track timeout occurrences to optimize your configuration:
class TimeoutMonitor {
constructor() {
this.timeoutStats = {
total: 0,
timeouts: 0,
avgResponseTime: 0
};
}
async scrapeWithMonitoring(url, timeout = 30000) {
const startTime = Date.now();
try {
const result = await app.scrapeUrl(url, { timeout });
const responseTime = Date.now() - startTime;
this.updateStats(responseTime, false);
console.log(`Success: ${url} (${responseTime}ms)`);
return result;
} catch (error) {
const responseTime = Date.now() - startTime;
if (error.message.includes('timeout')) {
this.updateStats(responseTime, true);
console.log(`Timeout: ${url} (${responseTime}ms)`);
}
throw error;
}
}
updateStats(responseTime, isTimeout) {
this.timeoutStats.total++;
if (isTimeout) this.timeoutStats.timeouts++;
this.timeoutStats.avgResponseTime =
(this.timeoutStats.avgResponseTime * (this.timeoutStats.total - 1) + responseTime) /
this.timeoutStats.total;
}
getStats() {
return {
...this.timeoutStats,
timeoutRate: (this.timeoutStats.timeouts / this.timeoutStats.total) * 100
};
}
}
// Usage
const monitor = new TimeoutMonitor();
await monitor.scrapeWithMonitoring('https://example.com');
console.log('Stats:', monitor.getStats());
Handling Crawl Timeouts
When crawling multiple pages, implementing proper timeout handling becomes even more critical, much like handling timeouts in Puppeteer:
from firecrawl import FirecrawlApp
import time
app = FirecrawlApp(api_key='your_api_key')
def crawl_with_timeout_handling(base_url, max_pages=100):
"""Crawl with comprehensive timeout handling"""
try:
# Start the crawl
crawl_id = app.crawl_url(
base_url,
params={
'limit': max_pages,
'timeout': 30000, # Per-page timeout
'crawlTimeout': 600000, # 10-minute total timeout
'waitFor': 3000
},
wait_until_done=False # Don't wait, poll instead
)
print(f"Crawl started with ID: {crawl_id}")
# Poll for results with timeout
max_poll_time = 700 # 700 seconds (slightly more than crawlTimeout)
poll_interval = 5 # Check every 5 seconds
elapsed_time = 0
while elapsed_time < max_poll_time:
status = app.check_crawl_status(crawl_id)
if status['status'] == 'completed':
print(f"Crawl completed successfully after {elapsed_time}s")
return status['data']
elif status['status'] == 'failed':
print(f"Crawl failed: {status.get('error', 'Unknown error')}")
return None
print(f"Crawl in progress... ({elapsed_time}s elapsed)")
time.sleep(poll_interval)
elapsed_time += poll_interval
print("Crawl polling timeout reached")
return None
except Exception as e:
print(f"Crawl error: {e}")
return None
# Usage
results = crawl_with_timeout_handling('https://example.com', max_pages=50)
if results:
print(f"Successfully crawled {len(results)} pages")
Advanced Timeout Strategies
Adaptive Timeout Adjustment
Automatically adjust timeouts based on website performance:
class AdaptiveTimeoutScraper {
constructor(apiKey) {
this.app = new FirecrawlApp({ apiKey });
this.baseTimeout = 30000;
this.performanceHistory = [];
}
async scrape(url) {
const adaptiveTimeout = this.calculateTimeout();
const startTime = Date.now();
try {
const result = await this.app.scrapeUrl(url, {
timeout: adaptiveTimeout
});
const responseTime = Date.now() - startTime;
this.recordPerformance(responseTime, true);
return result;
} catch (error) {
const responseTime = Date.now() - startTime;
this.recordPerformance(responseTime, false);
throw error;
}
}
calculateTimeout() {
if (this.performanceHistory.length === 0) {
return this.baseTimeout;
}
// Calculate 95th percentile of successful response times
const successfulTimes = this.performanceHistory
.filter(h => h.success)
.map(h => h.time)
.sort((a, b) => a - b);
if (successfulTimes.length === 0) {
return this.baseTimeout * 1.5; // Increase if no successes
}
const p95Index = Math.floor(successfulTimes.length * 0.95);
const p95Time = successfulTimes[p95Index];
// Set timeout to 1.5x the 95th percentile
return Math.max(this.baseTimeout, Math.min(p95Time * 1.5, 120000));
}
recordPerformance(time, success) {
this.performanceHistory.push({ time, success });
// Keep only last 100 records
if (this.performanceHistory.length > 100) {
this.performanceHistory.shift();
}
}
}
// Usage
const scraper = new AdaptiveTimeoutScraper('your_api_key');
const result = await scraper.scrape('https://example.com');
Common Timeout Issues and Solutions
Issue 1: Timeouts on JavaScript-Heavy Sites
Solution: Increase waitFor
parameter to allow JavaScript to execute:
result = app.scrape_url(
'https://spa-website.com',
params={
'timeout': 45000,
'waitFor': 10000 # Wait 10 seconds for JS to execute
}
)
Issue 2: Intermittent Timeouts
Solution: Implement retry logic with exponential backoff as shown in the error handling examples above.
Issue 3: Crawl Jobs Timing Out
Solution: Reduce the number of concurrent pages or increase the total crawl timeout:
crawl_result = app.crawl_url(
'https://example.com',
params={
'limit': 50, # Reduce page limit
'timeout': 40000, # Increase per-page timeout
'crawlTimeout': 900000 # 15-minute total timeout
}
)
Conclusion
Handling timeouts effectively in Firecrawl requires a combination of proper configuration, robust error handling, and adaptive strategies. By implementing the techniques outlined in this guide—including retry logic, progressive timeouts, and monitoring—you can build reliable web scraping applications that handle slow-loading websites gracefully.
Remember to always monitor your timeout statistics to optimize your configuration over time, and adjust timeout values based on the specific characteristics of the websites you're scraping. With proper timeout management, your Firecrawl-based scraping operations will be more resilient and efficient.