How do I configure proxy rotation in Crawlee?

Proxy rotation is essential for large-scale web scraping projects to avoid IP bans, bypass rate limits, and distribute requests across multiple IP addresses. Crawlee provides built-in support for proxy rotation through its ProxyConfiguration class, making it easy to implement sophisticated proxy management strategies.

Understanding Crawlee's Proxy System

Crawlee's proxy rotation system automatically rotates through a list of proxies for each request, ensuring that your scraper doesn't overload any single IP address. The framework supports multiple proxy sources including custom proxy lists, proxy services, and residential proxy providers.

The proxy rotation happens at the session level, meaning that each session can use a different proxy. This is particularly useful when you need to maintain consistency across multiple requests to the same domain while still distributing load across different proxies.

Basic Proxy Configuration

To configure proxy rotation in Crawlee, you use the ProxyConfiguration class. Here's a basic example in JavaScript/TypeScript:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

// Create a proxy configuration with a list of proxies
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
        'http://proxy3.example.com:8000',
    ],
});

// Use the proxy configuration in your crawler
const crawler = new CheerioCrawler({
    proxyConfiguration,
    async requestHandler({ request, $, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.url}: ${title}`);
    },
});

await crawler.run(['https://example.com']);

For Python users, the implementation is similar:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration

# Create proxy configuration
proxy_configuration = ProxyConfiguration(
    proxy_urls=[
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
        'http://proxy3.example.com:8000',
    ]
)

# Initialize crawler with proxy configuration
crawler = PlaywrightCrawler(
    proxy_configuration=proxy_configuration,
)

@crawler.router.default_handler
async def request_handler(context):
    title = await context.page.title()
    context.log.info(f'Title: {title}')

await crawler.run(['https://example.com'])

Authenticated Proxies

When working with authenticated proxies (proxies that require username and password), you can include the credentials directly in the proxy URL:

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://username:password@proxy1.example.com:8000',
        'http://username:password@proxy2.example.com:8000',
        'http://username:password@proxy3.example.com:8000',
    ],
});

Session-Based Proxy Rotation

Crawlee's session management system works seamlessly with proxy rotation. Each session can be assigned to a specific proxy, ensuring that all requests within that session use the same IP address. This is crucial for websites that track user sessions:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
        'http://proxy3.example.com:8000',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    useSessionPool: true,
    persistCookiesPerSession: true,
    async requestHandler({ request, $, session, log }) {
        log.info(`Using proxy: ${session.proxyUrl}`);

        const title = $('title').text();
        log.info(`Title: ${title}`);

        // All subsequent requests in this session will use the same proxy
    },
});

Dynamic Proxy Lists

For more advanced scenarios, you can dynamically load proxies from a file or external API:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
import { readFile } from 'fs/promises';

// Load proxies from a file
async function loadProxies() {
    const proxyList = await readFile('proxies.txt', 'utf-8');
    return proxyList.split('\n')
        .filter(line => line.trim())
        .map(line => `http://${line.trim()}`);
}

const proxyUrls = await loadProxies();

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls,
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    async requestHandler({ request, $, log }) {
        // Your scraping logic here
    },
});

Proxy Rotation with Custom Logic

Crawlee allows you to implement custom proxy rotation logic using the newUrlFunction option. This gives you fine-grained control over which proxy is selected for each request:

const proxyConfiguration = new ProxyConfiguration({
    newUrlFunction: () => {
        // Custom logic to select proxy
        const proxies = [
            'http://proxy1.example.com:8000',
            'http://proxy2.example.com:8000',
            'http://proxy3.example.com:8000',
        ];

        // Example: Round-robin selection
        const index = Math.floor(Math.random() * proxies.length);
        return proxies[index];
    },
});

Testing Proxy Configuration

Before running a large-scale scraping job, it's important to test your proxy configuration. Here's how to verify that your proxies are working:

import { ProxyConfiguration } from 'crawlee';
import { gotScraping } from 'got-scraping';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

// Test each proxy
for (const proxyUrl of proxyConfiguration.proxyUrls) {
    try {
        const proxyInfo = await proxyConfiguration.newProxyInfo();
        console.log(`Testing proxy: ${proxyInfo.url}`);

        const response = await gotScraping({
            url: 'https://api.ipify.org?format=json',
            proxyUrl: proxyInfo.url,
            responseType: 'json',
        });

        console.log(`Proxy IP: ${response.body.ip}`);
    } catch (error) {
        console.error(`Proxy failed: ${error.message}`);
    }
}

Handling Proxy Errors

When a proxy fails, Crawlee automatically retries the request with a different proxy. You can customize this behavior with error handling:

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestRetries: 5,
    async requestHandler({ request, $, log }) {
        // Your scraping logic
    },
    async failedRequestHandler({ request, log }, error) {
        log.error(`Request ${request.url} failed: ${error.message}`);

        // You can implement custom retry logic here
        if (error.message.includes('proxy')) {
            log.warning('Proxy-related error detected');
            // Handle proxy rotation or replacement
        }
    },
});

Best Practices for Proxy Rotation

Use Multiple Proxy Providers: Don't rely on a single proxy source. Combine different proxy providers to ensure reliability.
Monitor Proxy Health: Regularly check proxy performance and remove non-functional proxies from your rotation pool.
Match Proxy Type to Website: Use residential proxies for websites with strict anti-bot measures, and datacenter proxies for less restrictive sites.
Implement Rate Limiting: Even with proxy rotation, implement rate limiting to avoid overwhelming target websites similar to how you would handle timeouts in Puppeteer.
Geographic Targeting: If scraping geo-specific content, use proxies from the relevant geographic region.
Session Persistence: For websites that require login or maintain user sessions, ensure that the same proxy is used throughout the session, much like handling authentication in Puppeteer.

Integrating with Proxy Services

Many proxy services provide API access for rotating proxies. Here's an example of integrating with a proxy service:

import { ProxyConfiguration } from 'crawlee';
import axios from 'axios';

async function getProxiesFromService() {
    const response = await axios.get('https://api.proxyservice.com/get-proxies', {
        headers: {
            'Authorization': 'Bearer YOUR_API_KEY',
        },
    });

    return response.data.proxies.map(proxy =>
        `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`
    );
}

const proxyUrls = await getProxiesFromService();

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls,
});

Advanced Configuration Options

Crawlee's ProxyConfiguration supports several advanced options:

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: ['...'],

    // Rotate proxies after N uses
    rotateProxiesAfter: 100,

    // Custom proxy selection function
    newUrlFunction: () => {
        // Your custom proxy selection logic
    },

    // Session-specific proxy assignment
    sessionIdFunction: (proxyUrl) => {
        // Return a session ID based on proxy
        return proxyUrl;
    },
});

Monitoring Proxy Performance

To track which proxies are performing well, you can log proxy usage and success rates:

const proxyStats = new Map();

const crawler = new CheerioCrawler({
    proxyConfiguration,
    async requestHandler({ request, session, log }) {
        const proxyUrl = session?.proxyUrl;

        if (proxyUrl) {
            const stats = proxyStats.get(proxyUrl) || { success: 0, total: 0 };
            stats.total++;
            stats.success++;
            proxyStats.set(proxyUrl, stats);
        }

        // Your scraping logic
    },
    async failedRequestHandler({ request, session }, error) {
        const proxyUrl = session?.proxyUrl;

        if (proxyUrl) {
            const stats = proxyStats.get(proxyUrl) || { success: 0, total: 0 };
            stats.total++;
            proxyStats.set(proxyUrl, stats);
        }
    },
});

// After crawling, analyze proxy performance
crawler.run(['https://example.com']).then(() => {
    console.log('Proxy Performance:');
    proxyStats.forEach((stats, proxy) => {
        const successRate = (stats.success / stats.total * 100).toFixed(2);
        console.log(`${proxy}: ${successRate}% success rate (${stats.success}/${stats.total})`);
    });
});

Conclusion

Proxy rotation in Crawlee is a powerful feature that helps you build robust, scalable web scrapers. By properly configuring proxy rotation, handling errors gracefully, and monitoring performance, you can ensure that your scraping projects run smoothly even at large scales. Whether you're using a simple list of proxies or integrating with sophisticated proxy services, Crawlee's flexible proxy configuration system provides the tools you need to succeed.

Remember to always respect website terms of service and robots.txt files, and use proxies responsibly. Proper proxy rotation is not about bypassing security measures, but about distributing load and ensuring reliable data collection for legitimate purposes.

Table of contents

How do I configure proxy rotation in Crawlee?

Understanding Crawlee's Proxy System

Basic Proxy Configuration

Authenticated Proxies

Session-Based Proxy Rotation

Dynamic Proxy Lists

Proxy Rotation with Custom Logic

Testing Proxy Configuration

Handling Proxy Errors

Best Practices for Proxy Rotation

Integrating with Proxy Services

Advanced Configuration Options

Monitoring Proxy Performance

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Does Crawlee support session management for authenticated scraping?

How do I handle cookies and sessions in Crawlee?

What proxy providers work best with Crawlee?

Get Started Now

Support