Table of contents

How do I Handle Timeouts When Using Firecrawl?

Timeout handling is crucial when working with Firecrawl, especially when scraping slow-loading websites or dealing with dynamic content. Properly configured timeouts ensure your scraping operations don't hang indefinitely while giving pages enough time to load completely.

Understanding Firecrawl Timeout Parameters

Firecrawl provides several timeout-related parameters that control how long the scraper waits for various operations. The main timeout parameter is timeout, which specifies the maximum time in milliseconds to wait for a page to load.

Basic Timeout Configuration

Here's how to set a basic timeout in Firecrawl using Python:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

# Set a 30-second timeout
result = app.scrape_url(
    'https://example.com',
    params={
        'timeout': 30000  # 30 seconds in milliseconds
    }
)

print(result)

In JavaScript/Node.js:

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

// Set a 30-second timeout
const result = await app.scrapeUrl('https://example.com', {
  timeout: 30000  // 30 seconds in milliseconds
});

console.log(result);

Timeout Configuration Options

Firecrawl supports multiple timeout-related settings depending on the operation you're performing:

1. Page Load Timeout

This is the primary timeout that controls how long Firecrawl waits for a page to load completely:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

result = app.scrape_url(
    'https://slow-loading-site.com',
    params={
        'timeout': 60000,  # 60 seconds for slow sites
        'waitFor': 5000    # Additional wait after page load
    }
)

2. Wait For Specific Elements

When scraping dynamic content, you might need to wait for specific elements to appear. Similar to handling AJAX requests using Puppeteer, Firecrawl allows you to specify selectors to wait for:

const result = await app.scrapeUrl('https://dynamic-site.com', {
  timeout: 30000,
  waitFor: 'selector:.data-loaded',  // Wait for specific element
});

3. Crawl Job Timeouts

When using Firecrawl's crawling feature to scrape multiple pages, you can set timeouts for the entire crawl job:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='your_api_key')

crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'timeout': 45000,      # Per-page timeout
        'crawlTimeout': 300000 # Total crawl timeout (5 minutes)
    }
)

Implementing Robust Timeout Error Handling

Timeouts can occur for various reasons, and proper error handling ensures your application handles them gracefully:

Python Error Handling

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='your_api_key')

def scrape_with_retry(url, max_retries=3):
    """Scrape with automatic retry on timeout"""

    for attempt in range(max_retries):
        try:
            result = app.scrape_url(
                url,
                params={
                    'timeout': 30000,
                    'waitFor': 3000
                }
            )
            return result

        except TimeoutError as e:
            print(f"Timeout on attempt {attempt + 1}: {e}")
            if attempt < max_retries - 1:
                # Exponential backoff
                wait_time = 2 ** attempt
                print(f"Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print("Max retries reached. Scraping failed.")
                raise

        except Exception as e:
            print(f"Unexpected error: {e}")
            raise

# Usage
try:
    data = scrape_with_retry('https://example.com')
    print("Scraping successful:", data)
except Exception as e:
    print(f"Failed to scrape: {e}")

JavaScript/Node.js Error Handling

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: 'your_api_key' });

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const result = await app.scrapeUrl(url, {
        timeout: 30000,
        waitFor: 3000
      });
      return result;

    } catch (error) {
      if (error.message.includes('timeout')) {
        console.log(`Timeout on attempt ${attempt + 1}: ${error.message}`);

        if (attempt < maxRetries - 1) {
          const waitTime = Math.pow(2, attempt) * 1000;
          console.log(`Retrying in ${waitTime/1000} seconds...`);
          await new Promise(resolve => setTimeout(resolve, waitTime));
        } else {
          console.log('Max retries reached. Scraping failed.');
          throw error;
        }
      } else {
        console.log(`Unexpected error: ${error.message}`);
        throw error;
      }
    }
  }
}

// Usage
try {
  const data = await scrapeWithRetry('https://example.com');
  console.log('Scraping successful:', data);
} catch (error) {
  console.error('Failed to scrape:', error);
}

Best Practices for Timeout Configuration

1. Choose Appropriate Timeout Values

Different types of websites require different timeout values:

# Fast static sites
static_site_params = {
    'timeout': 15000  # 15 seconds
}

# Dynamic sites with JavaScript
dynamic_site_params = {
    'timeout': 30000,  # 30 seconds
    'waitFor': 5000    # Wait 5 seconds after load
}

# Very slow or heavy sites
heavy_site_params = {
    'timeout': 60000,  # 60 seconds
    'waitFor': 10000   # Wait 10 seconds after load
}

2. Implement Progressive Timeout Strategy

Start with shorter timeouts and increase progressively on retry:

def progressive_timeout_scrape(url):
    timeouts = [20000, 40000, 60000]  # Progressive timeouts

    for timeout in timeouts:
        try:
            result = app.scrape_url(
                url,
                params={'timeout': timeout}
            )
            return result
        except TimeoutError:
            if timeout == timeouts[-1]:
                raise
            print(f"Timeout at {timeout}ms, trying longer timeout...")
            continue

3. Use Timeout Monitoring and Logging

Track timeout occurrences to optimize your configuration:

class TimeoutMonitor {
  constructor() {
    this.timeoutStats = {
      total: 0,
      timeouts: 0,
      avgResponseTime: 0
    };
  }

  async scrapeWithMonitoring(url, timeout = 30000) {
    const startTime = Date.now();

    try {
      const result = await app.scrapeUrl(url, { timeout });

      const responseTime = Date.now() - startTime;
      this.updateStats(responseTime, false);

      console.log(`Success: ${url} (${responseTime}ms)`);
      return result;

    } catch (error) {
      const responseTime = Date.now() - startTime;

      if (error.message.includes('timeout')) {
        this.updateStats(responseTime, true);
        console.log(`Timeout: ${url} (${responseTime}ms)`);
      }

      throw error;
    }
  }

  updateStats(responseTime, isTimeout) {
    this.timeoutStats.total++;
    if (isTimeout) this.timeoutStats.timeouts++;

    this.timeoutStats.avgResponseTime =
      (this.timeoutStats.avgResponseTime * (this.timeoutStats.total - 1) + responseTime) /
      this.timeoutStats.total;
  }

  getStats() {
    return {
      ...this.timeoutStats,
      timeoutRate: (this.timeoutStats.timeouts / this.timeoutStats.total) * 100
    };
  }
}

// Usage
const monitor = new TimeoutMonitor();
await monitor.scrapeWithMonitoring('https://example.com');
console.log('Stats:', monitor.getStats());

Handling Crawl Timeouts

When crawling multiple pages, implementing proper timeout handling becomes even more critical, much like handling timeouts in Puppeteer:

from firecrawl import FirecrawlApp
import time

app = FirecrawlApp(api_key='your_api_key')

def crawl_with_timeout_handling(base_url, max_pages=100):
    """Crawl with comprehensive timeout handling"""

    try:
        # Start the crawl
        crawl_id = app.crawl_url(
            base_url,
            params={
                'limit': max_pages,
                'timeout': 30000,       # Per-page timeout
                'crawlTimeout': 600000, # 10-minute total timeout
                'waitFor': 3000
            },
            wait_until_done=False  # Don't wait, poll instead
        )

        print(f"Crawl started with ID: {crawl_id}")

        # Poll for results with timeout
        max_poll_time = 700  # 700 seconds (slightly more than crawlTimeout)
        poll_interval = 5    # Check every 5 seconds
        elapsed_time = 0

        while elapsed_time < max_poll_time:
            status = app.check_crawl_status(crawl_id)

            if status['status'] == 'completed':
                print(f"Crawl completed successfully after {elapsed_time}s")
                return status['data']

            elif status['status'] == 'failed':
                print(f"Crawl failed: {status.get('error', 'Unknown error')}")
                return None

            print(f"Crawl in progress... ({elapsed_time}s elapsed)")
            time.sleep(poll_interval)
            elapsed_time += poll_interval

        print("Crawl polling timeout reached")
        return None

    except Exception as e:
        print(f"Crawl error: {e}")
        return None

# Usage
results = crawl_with_timeout_handling('https://example.com', max_pages=50)
if results:
    print(f"Successfully crawled {len(results)} pages")

Advanced Timeout Strategies

Adaptive Timeout Adjustment

Automatically adjust timeouts based on website performance:

class AdaptiveTimeoutScraper {
  constructor(apiKey) {
    this.app = new FirecrawlApp({ apiKey });
    this.baseTimeout = 30000;
    this.performanceHistory = [];
  }

  async scrape(url) {
    const adaptiveTimeout = this.calculateTimeout();
    const startTime = Date.now();

    try {
      const result = await this.app.scrapeUrl(url, {
        timeout: adaptiveTimeout
      });

      const responseTime = Date.now() - startTime;
      this.recordPerformance(responseTime, true);

      return result;

    } catch (error) {
      const responseTime = Date.now() - startTime;
      this.recordPerformance(responseTime, false);
      throw error;
    }
  }

  calculateTimeout() {
    if (this.performanceHistory.length === 0) {
      return this.baseTimeout;
    }

    // Calculate 95th percentile of successful response times
    const successfulTimes = this.performanceHistory
      .filter(h => h.success)
      .map(h => h.time)
      .sort((a, b) => a - b);

    if (successfulTimes.length === 0) {
      return this.baseTimeout * 1.5; // Increase if no successes
    }

    const p95Index = Math.floor(successfulTimes.length * 0.95);
    const p95Time = successfulTimes[p95Index];

    // Set timeout to 1.5x the 95th percentile
    return Math.max(this.baseTimeout, Math.min(p95Time * 1.5, 120000));
  }

  recordPerformance(time, success) {
    this.performanceHistory.push({ time, success });

    // Keep only last 100 records
    if (this.performanceHistory.length > 100) {
      this.performanceHistory.shift();
    }
  }
}

// Usage
const scraper = new AdaptiveTimeoutScraper('your_api_key');
const result = await scraper.scrape('https://example.com');

Common Timeout Issues and Solutions

Issue 1: Timeouts on JavaScript-Heavy Sites

Solution: Increase waitFor parameter to allow JavaScript to execute:

result = app.scrape_url(
    'https://spa-website.com',
    params={
        'timeout': 45000,
        'waitFor': 10000  # Wait 10 seconds for JS to execute
    }
)

Issue 2: Intermittent Timeouts

Solution: Implement retry logic with exponential backoff as shown in the error handling examples above.

Issue 3: Crawl Jobs Timing Out

Solution: Reduce the number of concurrent pages or increase the total crawl timeout:

crawl_result = app.crawl_url(
    'https://example.com',
    params={
        'limit': 50,            # Reduce page limit
        'timeout': 40000,       # Increase per-page timeout
        'crawlTimeout': 900000  # 15-minute total timeout
    }
)

Conclusion

Handling timeouts effectively in Firecrawl requires a combination of proper configuration, robust error handling, and adaptive strategies. By implementing the techniques outlined in this guide—including retry logic, progressive timeouts, and monitoring—you can build reliable web scraping applications that handle slow-loading websites gracefully.

Remember to always monitor your timeout statistics to optimize your configuration over time, and adjust timeout values based on the specific characteristics of the websites you're scraping. With proper timeout management, your Firecrawl-based scraping operations will be more resilient and efficient.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon