How do I configure proxy rotation in Crawlee?
Proxy rotation is essential for large-scale web scraping projects to avoid IP bans, bypass rate limits, and distribute requests across multiple IP addresses. Crawlee provides built-in support for proxy rotation through its ProxyConfiguration
class, making it easy to implement sophisticated proxy management strategies.
Understanding Crawlee's Proxy System
Crawlee's proxy rotation system automatically rotates through a list of proxies for each request, ensuring that your scraper doesn't overload any single IP address. The framework supports multiple proxy sources including custom proxy lists, proxy services, and residential proxy providers.
The proxy rotation happens at the session level, meaning that each session can use a different proxy. This is particularly useful when you need to maintain consistency across multiple requests to the same domain while still distributing load across different proxies.
Basic Proxy Configuration
To configure proxy rotation in Crawlee, you use the ProxyConfiguration
class. Here's a basic example in JavaScript/TypeScript:
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
// Create a proxy configuration with a list of proxies
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
],
});
// Use the proxy configuration in your crawler
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ request, $, log }) {
const title = $('title').text();
log.info(`Title of ${request.url}: ${title}`);
},
});
await crawler.run(['https://example.com']);
For Python users, the implementation is similar:
from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration
# Create proxy configuration
proxy_configuration = ProxyConfiguration(
proxy_urls=[
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
]
)
# Initialize crawler with proxy configuration
crawler = PlaywrightCrawler(
proxy_configuration=proxy_configuration,
)
@crawler.router.default_handler
async def request_handler(context):
title = await context.page.title()
context.log.info(f'Title: {title}')
await crawler.run(['https://example.com'])
Authenticated Proxies
When working with authenticated proxies (proxies that require username and password), you can include the credentials directly in the proxy URL:
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://username:password@proxy1.example.com:8000',
'http://username:password@proxy2.example.com:8000',
'http://username:password@proxy3.example.com:8000',
],
});
Session-Based Proxy Rotation
Crawlee's session management system works seamlessly with proxy rotation. Each session can be assigned to a specific proxy, ensuring that all requests within that session use the same IP address. This is crucial for websites that track user sessions:
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
useSessionPool: true,
persistCookiesPerSession: true,
async requestHandler({ request, $, session, log }) {
log.info(`Using proxy: ${session.proxyUrl}`);
const title = $('title').text();
log.info(`Title: ${title}`);
// All subsequent requests in this session will use the same proxy
},
});
Dynamic Proxy Lists
For more advanced scenarios, you can dynamically load proxies from a file or external API:
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
import { readFile } from 'fs/promises';
// Load proxies from a file
async function loadProxies() {
const proxyList = await readFile('proxies.txt', 'utf-8');
return proxyList.split('\n')
.filter(line => line.trim())
.map(line => `http://${line.trim()}`);
}
const proxyUrls = await loadProxies();
const proxyConfiguration = new ProxyConfiguration({
proxyUrls,
});
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ request, $, log }) {
// Your scraping logic here
},
});
Proxy Rotation with Custom Logic
Crawlee allows you to implement custom proxy rotation logic using the newUrlFunction
option. This gives you fine-grained control over which proxy is selected for each request:
const proxyConfiguration = new ProxyConfiguration({
newUrlFunction: () => {
// Custom logic to select proxy
const proxies = [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
];
// Example: Round-robin selection
const index = Math.floor(Math.random() * proxies.length);
return proxies[index];
},
});
Testing Proxy Configuration
Before running a large-scale scraping job, it's important to test your proxy configuration. Here's how to verify that your proxies are working:
import { ProxyConfiguration } from 'crawlee';
import { gotScraping } from 'got-scraping';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
});
// Test each proxy
for (const proxyUrl of proxyConfiguration.proxyUrls) {
try {
const proxyInfo = await proxyConfiguration.newProxyInfo();
console.log(`Testing proxy: ${proxyInfo.url}`);
const response = await gotScraping({
url: 'https://api.ipify.org?format=json',
proxyUrl: proxyInfo.url,
responseType: 'json',
});
console.log(`Proxy IP: ${response.body.ip}`);
} catch (error) {
console.error(`Proxy failed: ${error.message}`);
}
}
Handling Proxy Errors
When a proxy fails, Crawlee automatically retries the request with a different proxy. You can customize this behavior with error handling:
const crawler = new CheerioCrawler({
proxyConfiguration,
maxRequestRetries: 5,
async requestHandler({ request, $, log }) {
// Your scraping logic
},
async failedRequestHandler({ request, log }, error) {
log.error(`Request ${request.url} failed: ${error.message}`);
// You can implement custom retry logic here
if (error.message.includes('proxy')) {
log.warning('Proxy-related error detected');
// Handle proxy rotation or replacement
}
},
});
Best Practices for Proxy Rotation
Use Multiple Proxy Providers: Don't rely on a single proxy source. Combine different proxy providers to ensure reliability.
Monitor Proxy Health: Regularly check proxy performance and remove non-functional proxies from your rotation pool.
Match Proxy Type to Website: Use residential proxies for websites with strict anti-bot measures, and datacenter proxies for less restrictive sites.
Implement Rate Limiting: Even with proxy rotation, implement rate limiting to avoid overwhelming target websites similar to how you would handle timeouts in Puppeteer.
Geographic Targeting: If scraping geo-specific content, use proxies from the relevant geographic region.
Session Persistence: For websites that require login or maintain user sessions, ensure that the same proxy is used throughout the session, much like handling authentication in Puppeteer.
Integrating with Proxy Services
Many proxy services provide API access for rotating proxies. Here's an example of integrating with a proxy service:
import { ProxyConfiguration } from 'crawlee';
import axios from 'axios';
async function getProxiesFromService() {
const response = await axios.get('https://api.proxyservice.com/get-proxies', {
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
},
});
return response.data.proxies.map(proxy =>
`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`
);
}
const proxyUrls = await getProxiesFromService();
const proxyConfiguration = new ProxyConfiguration({
proxyUrls,
});
Advanced Configuration Options
Crawlee's ProxyConfiguration
supports several advanced options:
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: ['...'],
// Rotate proxies after N uses
rotateProxiesAfter: 100,
// Custom proxy selection function
newUrlFunction: () => {
// Your custom proxy selection logic
},
// Session-specific proxy assignment
sessionIdFunction: (proxyUrl) => {
// Return a session ID based on proxy
return proxyUrl;
},
});
Monitoring Proxy Performance
To track which proxies are performing well, you can log proxy usage and success rates:
const proxyStats = new Map();
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ request, session, log }) {
const proxyUrl = session?.proxyUrl;
if (proxyUrl) {
const stats = proxyStats.get(proxyUrl) || { success: 0, total: 0 };
stats.total++;
stats.success++;
proxyStats.set(proxyUrl, stats);
}
// Your scraping logic
},
async failedRequestHandler({ request, session }, error) {
const proxyUrl = session?.proxyUrl;
if (proxyUrl) {
const stats = proxyStats.get(proxyUrl) || { success: 0, total: 0 };
stats.total++;
proxyStats.set(proxyUrl, stats);
}
},
});
// After crawling, analyze proxy performance
crawler.run(['https://example.com']).then(() => {
console.log('Proxy Performance:');
proxyStats.forEach((stats, proxy) => {
const successRate = (stats.success / stats.total * 100).toFixed(2);
console.log(`${proxy}: ${successRate}% success rate (${stats.success}/${stats.total})`);
});
});
Conclusion
Proxy rotation in Crawlee is a powerful feature that helps you build robust, scalable web scrapers. By properly configuring proxy rotation, handling errors gracefully, and monitoring performance, you can ensure that your scraping projects run smoothly even at large scales. Whether you're using a simple list of proxies or integrating with sophisticated proxy services, Crawlee's flexible proxy configuration system provides the tools you need to succeed.
Remember to always respect website terms of service and robots.txt files, and use proxies responsibly. Proper proxy rotation is not about bypassing security measures, but about distributing load and ensuring reliable data collection for legitimate purposes.