Table of contents

How does Crawlee handle request retries and error recovery?

Crawlee provides robust built-in mechanisms for handling request retries and error recovery, making it an excellent choice for production web scraping. The framework automatically retries failed requests, manages error states, and provides developers with fine-grained control over retry behavior. This guide explores Crawlee's retry system, configuration options, and best practices for building resilient web scrapers.

Default Retry Behavior

Crawlee automatically retries failed requests without requiring explicit configuration. By default, each request that fails will be retried up to 3 times before being marked as permanently failed. This automatic retry mechanism handles common transient errors such as network timeouts, connection issues, and temporary server errors.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        // Crawlee automatically retries failed requests
        const title = $('title').text();
        log.info(`Title: ${title}`);
    },
    // Default maxRequestRetries is 3
});

await crawler.run(['https://example.com']);

When a request fails, Crawlee: 1. Catches the error automatically 2. Increments the retry counter for that request 3. Re-queues the request for another attempt 4. Waits before retrying (with exponential backoff) 5. Marks the request as failed after exceeding max retries

Configuring Maximum Retries

You can customize the number of retry attempts using the maxRequestRetries option in the crawler configuration:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 5, // Retry up to 5 times
    async requestHandler({ request, page, log }) {
        const content = await page.content();
        log.info(`Current retry count: ${request.retryCount}`);
    },
});

await crawler.run(['https://example.com']);
from crawlee.playwright_crawler import PlaywrightCrawler

crawler = PlaywrightCrawler(
    max_request_retries=5,  # Retry up to 5 times
)

@crawler.router.default_handler
async def request_handler(context):
    content = await context.page.content()
    context.log.info(f'Current retry count: {context.request.retry_count}')

await crawler.run(['https://example.com'])

Custom Error Handling

While Crawlee handles retries automatically, you can implement custom error handling logic using the failedRequestHandler callback. This is useful for logging failures, implementing custom recovery logic, or storing failed URLs for later analysis:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestRetries: 3,

    async requestHandler({ request, $, log }) {
        const title = $('title').text();
        log.info(`Scraped: ${title}`);
    },

    async failedRequestHandler({ request, log }) {
        // Called after all retries are exhausted
        log.error(`Request ${request.url} failed after ${request.retryCount} retries`);

        // Store failed URL for manual review
        await saveFailedUrl(request.url, request.errorMessages);
    },
});

await crawler.run(['https://example.com']);
from crawlee.playwright_crawler import PlaywrightCrawler

async def failed_request_handler(context):
    """Handle requests that failed after all retries."""
    context.log.error(
        f'Request {context.request.url} failed after {context.request.retry_count} retries'
    )
    # Store failed URL for manual review
    await save_failed_url(context.request.url, context.request.error_messages)

crawler = PlaywrightCrawler(
    max_request_retries=3,
    failed_request_handler=failed_request_handler,
)

Error Messages and Tracking

Crawlee tracks all errors that occur during request processing. You can access error information through the request object:

async failedRequestHandler({ request, log }) {
    log.error(`Failed URL: ${request.url}`);
    log.error(`Retry count: ${request.retryCount}`);
    log.error(`Error messages:`, request.errorMessages);

    // errorMessages is an array of all error messages from retry attempts
    request.errorMessages.forEach((error, index) => {
        log.error(`Attempt ${index + 1}: ${error}`);
    });
}

Selective Retry Logic

You can implement conditional retry logic by throwing errors selectively based on specific conditions. For certain errors, you might want to skip retries entirely:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestRetries: 3,

    async requestHandler({ request, $, log, crawler }) {
        const statusCode = request.loadedUrl ? 200 : 404;

        // Don't retry 404 errors
        if (statusCode === 404) {
            log.warning(`Page not found: ${request.url}`);
            request.noRetry = true;
            throw new Error('Page not found');
        }

        // Don't retry authentication errors
        if ($('title').text().includes('Login Required')) {
            request.noRetry = true;
            throw new Error('Authentication required');
        }

        // Process the page normally
        const title = $('title').text();
        log.info(`Title: ${title}`);
    },
});

Session Management and Error Recovery

Crawlee's session management system works hand-in-hand with error recovery. Sessions help maintain cookies, user agents, and proxy assignments across requests. When handling authentication in Puppeteer or similar browser automation scenarios, sessions are automatically rotated when errors occur:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxErrorScore: 3, // Retire session after 3 errors
            errorScoreDecrement: 0.5,
            maxUsageCount: 50,
        },
    },

    async requestHandler({ request, page, session, log }) {
        // Session is automatically managed and rotated on errors
        log.info(`Using session: ${session.id}`);
        const content = await page.content();
    },
});

Request Queue Persistence

Crawlee's request queue is persistent by default, which means failed requests and retry state survive process restarts. This is critical for long-running scraping jobs:

import { CheerioCrawler, RequestQueue } from 'crawlee';

// Named queue persists to disk
const requestQueue = await RequestQueue.open('my-scraper');

const crawler = new CheerioCrawler({
    maxRequestRetries: 5,
    requestQueue,

    async requestHandler({ request, $, log }) {
        // If the process crashes and restarts, Crawlee will:
        // 1. Resume from the saved queue state
        // 2. Maintain retry counts for each request
        // 3. Continue retrying failed requests
        const title = $('title').text();
        log.info(`Title: ${title}`);
    },
});

await crawler.run();

Handling Specific Error Types

Different types of errors may require different handling strategies. Here's how to implement error-type-specific logic:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestRetries: 4,

    async requestHandler({ request, $, log, crawler }) {
        try {
            const title = $('title').text();

            if (!title) {
                throw new Error('EMPTY_TITLE');
            }

            log.info(`Title: ${title}`);

        } catch (error) {
            // Handle timeout errors differently
            if (error.message.includes('timeout')) {
                log.warning(`Timeout on ${request.url}, will retry`);
                throw error; // Let Crawlee retry
            }

            // Don't retry for parsing errors
            if (error.message.includes('EMPTY_TITLE')) {
                request.noRetry = true;
                log.error(`Invalid page structure: ${request.url}`);
                throw error;
            }

            // Retry for all other errors
            throw error;
        }
    },
});

Exponential Backoff

Crawlee implements exponential backoff for retries, which helps avoid overwhelming servers and reduces the likelihood of being blocked. The delay between retries increases exponentially:

  • 1st retry: ~1 second
  • 2nd retry: ~2 seconds
  • 3rd retry: ~4 seconds
  • 4th retry: ~8 seconds

This behavior is automatic and doesn't require configuration, but you can implement custom delay logic if needed.

Monitoring Retry Statistics

Crawlee provides statistics about request success and failure rates. You can access these through the Statistics class:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestRetries: 3,

    async requestHandler({ request, $, log }) {
        const title = $('title').text();
        log.info(`Title: ${title}`);
    },
});

await crawler.run(['https://example.com']);

// Get statistics after the crawl
const stats = await crawler.stats.toJSON();
console.log('Requests finished:', stats.requestsFinished);
console.log('Requests failed:', stats.requestsFailed);
console.log('Retry histogram:', stats.requestRetryHistogram);

Best Practices for Error Recovery

  1. Set appropriate retry limits: Balance between persistence and efficiency. For most use cases, 3-5 retries is optimal.

  2. Implement failedRequestHandler: Always log and track failed requests for debugging and monitoring.

  3. Use selective retries: Don't retry errors that won't resolve (404s, authentication errors, parsing issues).

  4. Monitor error patterns: Track error types and frequencies to identify systemic issues.

  5. Implement circuit breakers: For scraping multiple domains, consider stopping crawls temporarily if error rates spike.

  6. Handle timeouts appropriately: When handling timeouts in Puppeteer or Playwright, ensure timeout values are realistic for your target sites.

  7. Use sessions wisely: Enable session rotation to recover from IP-based blocks or rate limits.

Combining Retries with Error Handling

Here's a complete example showing robust error handling with retries:

import { PlaywrightCrawler } from 'crawlee';
import { Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 4,
    useSessionPool: true,

    async requestHandler({ request, page, log, session }) {
        try {
            // Wait for content to load
            await page.waitForSelector('h1', { timeout: 10000 });

            const title = await page.title();
            const heading = await page.$eval('h1', el => el.textContent);

            await Dataset.pushData({
                url: request.url,
                title,
                heading,
                timestamp: new Date().toISOString(),
            });

            log.info(`✓ Scraped: ${title}`);

        } catch (error) {
            // Mark session as bad on consistent errors
            session.markBad();

            // Log detailed error information
            log.error(`Error on ${request.url}: ${error.message}`);

            // Re-throw to trigger retry
            throw error;
        }
    },

    async failedRequestHandler({ request, log }) {
        log.error(`
            ❌ Failed after ${request.retryCount} attempts
            URL: ${request.url}
            Errors: ${JSON.stringify(request.errorMessages, null, 2)}
        `);

        // Save to a separate dataset for analysis
        await Dataset.pushData({
            url: request.url,
            failed: true,
            retryCount: request.retryCount,
            errors: request.errorMessages,
            timestamp: new Date().toISOString(),
        }, { datasetId: 'failed-requests' });
    },
});

await crawler.run([
    'https://example.com',
    'https://example.org',
]);

Conclusion

Crawlee's automatic retry and error recovery mechanisms make it highly reliable for production web scraping. By understanding and properly configuring these features, you can build scrapers that gracefully handle failures, maintain data quality, and recover from transient errors. The combination of automatic retries, session management, persistent queues, and custom error handling provides a solid foundation for robust web scraping applications.

For more advanced error handling scenarios, consider exploring how to handle errors in Puppeteer for additional browser-specific error recovery techniques that complement Crawlee's built-in capabilities.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon