How do I implement IP rotation with Crawlee?

IP rotation is a critical technique in web scraping that helps you avoid rate limiting, IP bans, and detection by distributing requests across multiple IP addresses. Crawlee provides built-in support for proxy rotation, making it straightforward to implement IP rotation in your web scraping projects.

Understanding IP Rotation in Crawlee

Crawlee handles proxy management through its ProxyConfiguration class, which automatically rotates through a list of proxies for each request. This ensures that your scraper appears to come from different IP addresses, reducing the risk of being blocked or throttled by target websites.

The framework supports various proxy configurations including:

HTTP/HTTPS proxies: Standard web proxies
SOCKS proxies: More flexible protocol support
Proxy rotation strategies: Round-robin, random, or custom logic
Session-based proxies: Maintain the same IP for related requests

Basic IP Rotation Setup

Here's how to implement basic IP rotation with Crawlee using the ProxyConfiguration class:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

// Create a proxy configuration with multiple proxy URLs
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
        'http://username:password@proxy4.example.com:8080', // With authentication
    ],
});

// Initialize the crawler with proxy configuration
const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestHandler: async ({ request, page, log }) => {
        log.info(`Scraping ${request.url}`);

        // Your scraping logic here
        const title = await page.title();
        log.info(`Title: ${title}`);
    },
});

// Start crawling
await crawler.run(['https://example.com']);

Using Proxy Services with Crawlee

For production environments, you'll typically use commercial proxy services. Here's how to integrate popular proxy providers:

Bright Data (Luminati) Integration

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://username-session-random123:password@zproxy.lum-superproxy.io:22225',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: 100,
    requestHandler: async ({ request, $, log }) => {
        const title = $('title').text();
        log.info(`Scraped: ${title} from ${request.url}`);
    },
});

await crawler.run(['https://example.com']);

Rotating Proxies from a Proxy List

If you have a list of proxies from a file or API, you can load them dynamically:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { readFileSync } from 'fs';

// Load proxies from a file (one proxy per line)
const proxyList = readFileSync('proxies.txt', 'utf-8')
    .split('\n')
    .filter(line => line.trim())
    .map(proxy => `http://${proxy.trim()}`);

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: proxyList,
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestHandler: async ({ request, page, log, proxyInfo }) => {
        // Log which proxy is being used
        log.info(`Using proxy: ${proxyInfo.url}`);

        const content = await page.content();
        // Process your data here
    },
});

await crawler.run(['https://example.com']);

Advanced IP Rotation Strategies

Session-Based IP Rotation

For scenarios where you need to maintain the same IP address across multiple related requests (like maintaining a login session), use session-based proxies:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    // Use session pool to maintain same proxy per session
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 10,
        sessionOptions: {
            maxUsageCount: 50, // Retire session after 50 uses
        },
    },
    requestHandler: async ({ request, page, log, session }) => {
        log.info(`Session ID: ${session.id} - URL: ${request.url}`);

        // All requests in this session will use the same proxy
        const data = await page.evaluate(() => ({
            userAgent: navigator.userAgent,
            // This will show the same IP for all requests in this session
        }));

        log.info(`Data: ${JSON.stringify(data)}`);
    },
});

await crawler.run([
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
]);

Custom Proxy Selection Logic

You can implement custom logic for proxy selection using the newUrlFunction option:

import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://us-proxy1.example.com:8080',
        'http://us-proxy2.example.com:8080',
        'http://eu-proxy1.example.com:8080',
        'http://eu-proxy2.example.com:8080',
    ],
    newUrlFunction: (sessionId) => {
        // Custom logic: Use US proxies for even session IDs, EU for odd
        const useUSProxy = parseInt(sessionId, 10) % 2 === 0;
        const proxies = useUSProxy
            ? ['http://us-proxy1.example.com:8080', 'http://us-proxy2.example.com:8080']
            : ['http://eu-proxy1.example.com:8080', 'http://eu-proxy2.example.com:8080'];

        return proxies[Math.floor(Math.random() * proxies.length)];
    },
});

Testing and Verifying IP Rotation

It's crucial to verify that your IP rotation is working correctly. Here's a test script:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: 10,
    requestHandler: async ({ request, json, log, proxyInfo }) => {
        // Using a service that returns your IP address
        log.info(`Request ${request.id}: IP = ${json.ip}, Proxy = ${proxyInfo.url}`);
    },
});

// Use a service that returns your IP in JSON format
await crawler.run([
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
]);

Handling Proxy Failures and Retries

Crawlee automatically handles proxy failures and retries with different proxies. You can customize this behavior:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestRetries: 5, // Retry up to 5 times
    requestHandlerTimeoutSecs: 60,

    // Custom error handling
    failedRequestHandler: async ({ request, log }, error) => {
        log.error(`Request ${request.url} failed after retries: ${error.message}`);
        // Log failed requests for later processing
    },

    requestHandler: async ({ request, page, log, proxyInfo }) => {
        try {
            log.info(`Attempting ${request.url} with proxy ${proxyInfo.url}`);
            await page.goto(request.url, { timeout: 30000 });

            const title = await page.title();
            log.info(`Successfully scraped: ${title}`);
        } catch (error) {
            log.warning(`Error with proxy ${proxyInfo.url}: ${error.message}`);
            throw error; // Let Crawlee retry with a different proxy
        }
    },
});

await crawler.run(['https://example.com']);

Python Implementation with Crawlee

If you're using Crawlee for Python, the implementation is similar:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration

# Create proxy configuration
proxy_configuration = ProxyConfiguration(
    proxy_urls=[
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ]
)

# Initialize crawler with proxy configuration
crawler = PlaywrightCrawler(
    proxy_configuration=proxy_configuration,
    max_request_retries=5,
)

@crawler.router.default_handler
async def request_handler(context):
    """Handle each request with automatic IP rotation."""
    log = context.log
    page = context.page
    request = context.request
    proxy_info = context.proxy_info

    log.info(f'Scraping {request.url} using proxy {proxy_info.url}')

    title = await page.title()
    log.info(f'Page title: {title}')

# Run the crawler
await crawler.run(['https://example.com'])

Best Practices for IP Rotation

Proxy Pool Size: Maintain a pool of at least 10-50 proxies for effective rotation
Proxy Quality: Use residential or datacenter proxies from reputable providers
Request Timing: Combine IP rotation with request delays and throttling to mimic human behavior
Session Management: Use session-based rotation when handling browser sessions
Monitoring: Log proxy performance and automatically remove failing proxies
Geographic Distribution: Use proxies from different regions if targeting geo-restricted content
Proxy Authentication: Secure your proxies with username/password authentication
Cost Optimization: Monitor proxy usage to balance cost and scraping effectiveness

Integrating with WebScraping.AI

While Crawlee provides excellent proxy management, you can also leverage managed scraping services that handle IP rotation automatically. For example, WebScraping.AI provides built-in proxy rotation and handles all the complexity:

import axios from 'axios';

const apiKey = 'YOUR_API_KEY';
const targetUrl = 'https://example.com';

// WebScraping.AI automatically rotates IPs for each request
const response = await axios.get('https://api.webscraping.ai/html', {
    params: {
        api_key: apiKey,
        url: targetUrl,
        proxy: 'residential', // Use residential proxy with automatic rotation
    }
});

console.log(response.data);

This approach combines the power of Crawlee for orchestration with managed proxy services for reliable IP rotation, giving you the best of both worlds.

Conclusion

IP rotation is essential for large-scale web scraping projects, and Crawlee makes it straightforward to implement through its ProxyConfiguration class. Whether you're using a simple proxy list or integrating with commercial proxy providers, Crawlee handles the rotation automatically while providing flexibility for custom implementations.

Remember to always respect website terms of service and robots.txt files, use reasonable request rates, and ensure your scraping activities are legal and ethical. Combined with proper error handling and timeout management, IP rotation will help you build robust and reliable web scrapers that can operate at scale without interruption.

Table of contents

How do I implement IP rotation with Crawlee?

Understanding IP Rotation in Crawlee

Basic IP Rotation Setup

Using Proxy Services with Crawlee

Bright Data (Luminati) Integration

Rotating Proxies from a Proxy List

Advanced IP Rotation Strategies

Session-Based IP Rotation

Custom Proxy Selection Logic

Testing and Verifying IP Rotation

Handling Proxy Failures and Retries

Python Implementation with Crawlee

Best Practices for IP Rotation

Integrating with WebScraping.AI

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I control concurrency and parallelism in Crawlee?

What is autoscaling in Crawlee and how does it work?

How do I optimize Crawlee performance for faster scraping?

Get Started Now

Support