Table of contents

How do I implement IP rotation with Crawlee?

IP rotation is a critical technique in web scraping that helps you avoid rate limiting, IP bans, and detection by distributing requests across multiple IP addresses. Crawlee provides built-in support for proxy rotation, making it straightforward to implement IP rotation in your web scraping projects.

Understanding IP Rotation in Crawlee

Crawlee handles proxy management through its ProxyConfiguration class, which automatically rotates through a list of proxies for each request. This ensures that your scraper appears to come from different IP addresses, reducing the risk of being blocked or throttled by target websites.

The framework supports various proxy configurations including:

  • HTTP/HTTPS proxies: Standard web proxies
  • SOCKS proxies: More flexible protocol support
  • Proxy rotation strategies: Round-robin, random, or custom logic
  • Session-based proxies: Maintain the same IP for related requests

Basic IP Rotation Setup

Here's how to implement basic IP rotation with Crawlee using the ProxyConfiguration class:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

// Create a proxy configuration with multiple proxy URLs
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
        'http://username:password@proxy4.example.com:8080', // With authentication
    ],
});

// Initialize the crawler with proxy configuration
const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestHandler: async ({ request, page, log }) => {
        log.info(`Scraping ${request.url}`);

        // Your scraping logic here
        const title = await page.title();
        log.info(`Title: ${title}`);
    },
});

// Start crawling
await crawler.run(['https://example.com']);

Using Proxy Services with Crawlee

For production environments, you'll typically use commercial proxy services. Here's how to integrate popular proxy providers:

Bright Data (Luminati) Integration

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://username-session-random123:password@zproxy.lum-superproxy.io:22225',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: 100,
    requestHandler: async ({ request, $, log }) => {
        const title = $('title').text();
        log.info(`Scraped: ${title} from ${request.url}`);
    },
});

await crawler.run(['https://example.com']);

Rotating Proxies from a Proxy List

If you have a list of proxies from a file or API, you can load them dynamically:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { readFileSync } from 'fs';

// Load proxies from a file (one proxy per line)
const proxyList = readFileSync('proxies.txt', 'utf-8')
    .split('\n')
    .filter(line => line.trim())
    .map(proxy => `http://${proxy.trim()}`);

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: proxyList,
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestHandler: async ({ request, page, log, proxyInfo }) => {
        // Log which proxy is being used
        log.info(`Using proxy: ${proxyInfo.url}`);

        const content = await page.content();
        // Process your data here
    },
});

await crawler.run(['https://example.com']);

Advanced IP Rotation Strategies

Session-Based IP Rotation

For scenarios where you need to maintain the same IP address across multiple related requests (like maintaining a login session), use session-based proxies:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    // Use session pool to maintain same proxy per session
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 10,
        sessionOptions: {
            maxUsageCount: 50, // Retire session after 50 uses
        },
    },
    requestHandler: async ({ request, page, log, session }) => {
        log.info(`Session ID: ${session.id} - URL: ${request.url}`);

        // All requests in this session will use the same proxy
        const data = await page.evaluate(() => ({
            userAgent: navigator.userAgent,
            // This will show the same IP for all requests in this session
        }));

        log.info(`Data: ${JSON.stringify(data)}`);
    },
});

await crawler.run([
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
]);

Custom Proxy Selection Logic

You can implement custom logic for proxy selection using the newUrlFunction option:

import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://us-proxy1.example.com:8080',
        'http://us-proxy2.example.com:8080',
        'http://eu-proxy1.example.com:8080',
        'http://eu-proxy2.example.com:8080',
    ],
    newUrlFunction: (sessionId) => {
        // Custom logic: Use US proxies for even session IDs, EU for odd
        const useUSProxy = parseInt(sessionId, 10) % 2 === 0;
        const proxies = useUSProxy
            ? ['http://us-proxy1.example.com:8080', 'http://us-proxy2.example.com:8080']
            : ['http://eu-proxy1.example.com:8080', 'http://eu-proxy2.example.com:8080'];

        return proxies[Math.floor(Math.random() * proxies.length)];
    },
});

Testing and Verifying IP Rotation

It's crucial to verify that your IP rotation is working correctly. Here's a test script:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: 10,
    requestHandler: async ({ request, json, log, proxyInfo }) => {
        // Using a service that returns your IP address
        log.info(`Request ${request.id}: IP = ${json.ip}, Proxy = ${proxyInfo.url}`);
    },
});

// Use a service that returns your IP in JSON format
await crawler.run([
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
    'https://api.ipify.org?format=json',
]);

Handling Proxy Failures and Retries

Crawlee automatically handles proxy failures and retries with different proxies. You can customize this behavior:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestRetries: 5, // Retry up to 5 times
    requestHandlerTimeoutSecs: 60,

    // Custom error handling
    failedRequestHandler: async ({ request, log }, error) => {
        log.error(`Request ${request.url} failed after retries: ${error.message}`);
        // Log failed requests for later processing
    },

    requestHandler: async ({ request, page, log, proxyInfo }) => {
        try {
            log.info(`Attempting ${request.url} with proxy ${proxyInfo.url}`);
            await page.goto(request.url, { timeout: 30000 });

            const title = await page.title();
            log.info(`Successfully scraped: ${title}`);
        } catch (error) {
            log.warning(`Error with proxy ${proxyInfo.url}: ${error.message}`);
            throw error; // Let Crawlee retry with a different proxy
        }
    },
});

await crawler.run(['https://example.com']);

Python Implementation with Crawlee

If you're using Crawlee for Python, the implementation is similar:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration

# Create proxy configuration
proxy_configuration = ProxyConfiguration(
    proxy_urls=[
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080',
        'http://proxy3.example.com:8080',
    ]
)

# Initialize crawler with proxy configuration
crawler = PlaywrightCrawler(
    proxy_configuration=proxy_configuration,
    max_request_retries=5,
)

@crawler.router.default_handler
async def request_handler(context):
    """Handle each request with automatic IP rotation."""
    log = context.log
    page = context.page
    request = context.request
    proxy_info = context.proxy_info

    log.info(f'Scraping {request.url} using proxy {proxy_info.url}')

    title = await page.title()
    log.info(f'Page title: {title}')

# Run the crawler
await crawler.run(['https://example.com'])

Best Practices for IP Rotation

  1. Proxy Pool Size: Maintain a pool of at least 10-50 proxies for effective rotation
  2. Proxy Quality: Use residential or datacenter proxies from reputable providers
  3. Request Timing: Combine IP rotation with request delays and throttling to mimic human behavior
  4. Session Management: Use session-based rotation when handling browser sessions
  5. Monitoring: Log proxy performance and automatically remove failing proxies
  6. Geographic Distribution: Use proxies from different regions if targeting geo-restricted content
  7. Proxy Authentication: Secure your proxies with username/password authentication
  8. Cost Optimization: Monitor proxy usage to balance cost and scraping effectiveness

Integrating with WebScraping.AI

While Crawlee provides excellent proxy management, you can also leverage managed scraping services that handle IP rotation automatically. For example, WebScraping.AI provides built-in proxy rotation and handles all the complexity:

import axios from 'axios';

const apiKey = 'YOUR_API_KEY';
const targetUrl = 'https://example.com';

// WebScraping.AI automatically rotates IPs for each request
const response = await axios.get('https://api.webscraping.ai/html', {
    params: {
        api_key: apiKey,
        url: targetUrl,
        proxy: 'residential', // Use residential proxy with automatic rotation
    }
});

console.log(response.data);

This approach combines the power of Crawlee for orchestration with managed proxy services for reliable IP rotation, giving you the best of both worlds.

Conclusion

IP rotation is essential for large-scale web scraping projects, and Crawlee makes it straightforward to implement through its ProxyConfiguration class. Whether you're using a simple proxy list or integrating with commercial proxy providers, Crawlee handles the rotation automatically while providing flexibility for custom implementations.

Remember to always respect website terms of service and robots.txt files, use reasonable request rates, and ensure your scraping activities are legal and ethical. Combined with proper error handling and timeout management, IP rotation will help you build robust and reliable web scrapers that can operate at scale without interruption.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon