Table of contents

How to Use Cheerio with Proxy Servers for Web Scraping

When web scraping at scale, using proxy servers becomes essential to avoid IP blocking, rate limiting, and geographic restrictions. While Cheerio itself is a server-side HTML parser that doesn't handle HTTP requests directly, you can easily combine it with HTTP libraries that support proxy configuration to create powerful scraping solutions.

Understanding Cheerio and Proxy Integration

Cheerio is a jQuery-like server-side HTML parser for Node.js that doesn't make HTTP requests on its own. To use proxies with Cheerio, you need to:

  1. Use an HTTP client library (like Axios, node-fetch, or the built-in http module) to fetch web pages through proxies
  2. Pass the received HTML content to Cheerio for parsing
  3. Extract the desired data using Cheerio's jQuery-like syntax

Setting Up Cheerio with Axios and Proxy Support

The most common approach is using Axios with HTTP agents for proxy configuration:

const axios = require('axios');
const cheerio = require('cheerio');
const { HttpsProxyAgent } = require('https-proxy-agent');

// Configure proxy
const proxyUrl = 'http://username:password@proxy-server.com:8080';
const agent = new HttpsProxyAgent(proxyUrl);

async function scrapeWithProxy(url) {
  try {
    // Make request through proxy
    const response = await axios.get(url, {
      httpsAgent: agent,
      httpAgent: agent,
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      }
    });

    // Parse HTML with Cheerio
    const $ = cheerio.load(response.data);

    // Extract data
    const title = $('title').text();
    const links = [];

    $('a[href]').each((i, element) => {
      links.push({
        text: $(element).text().trim(),
        href: $(element).attr('href')
      });
    });

    return { title, links };
  } catch (error) {
    console.error('Scraping failed:', error.message);
    throw error;
  }
}

// Usage
scrapeWithProxy('https://example.com')
  .then(data => console.log(data))
  .catch(err => console.error(err));

Using SOCKS Proxies with Cheerio

For SOCKS proxies, you can use the socks-proxy-agent library:

const axios = require('axios');
const cheerio = require('cheerio');
const { SocksProxyAgent } = require('socks-proxy-agent');

// Configure SOCKS proxy
const socksProxyUrl = 'socks5://username:password@proxy-server.com:1080';
const agent = new SocksProxyAgent(socksProxyUrl);

async function scrapeWithSocksProxy(url) {
  const response = await axios.get(url, {
    httpsAgent: agent,
    httpAgent: agent,
    timeout: 10000
  });

  const $ = cheerio.load(response.data);

  // Extract product information
  const products = [];
  $('.product-item').each((i, element) => {
    products.push({
      name: $(element).find('.product-name').text().trim(),
      price: $(element).find('.product-price').text().trim(),
      image: $(element).find('img').attr('src')
    });
  });

  return products;
}

Implementing Proxy Rotation

For large-scale scraping, rotating between multiple proxies helps distribute requests and avoid detection:

const axios = require('axios');
const cheerio = require('cheerio');
const { HttpsProxyAgent } = require('https-proxy-agent');

class ProxyRotator {
  constructor(proxies) {
    this.proxies = proxies;
    this.currentIndex = 0;
  }

  getNextProxy() {
    const proxy = this.proxies[this.currentIndex];
    this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
    return proxy;
  }

  createAgent(proxy) {
    return new HttpsProxyAgent(`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`);
  }
}

// Initialize proxy rotator
const proxies = [
  { host: 'proxy1.com', port: 8080, username: 'user1', password: 'pass1' },
  { host: 'proxy2.com', port: 8080, username: 'user2', password: 'pass2' },
  { host: 'proxy3.com', port: 8080, username: 'user3', password: 'pass3' }
];

const proxyRotator = new ProxyRotator(proxies);

async function scrapeWithRotatingProxy(url) {
  const proxy = proxyRotator.getNextProxy();
  const agent = proxyRotator.createAgent(proxy);

  console.log(`Using proxy: ${proxy.host}:${proxy.port}`);

  try {
    const response = await axios.get(url, {
      httpsAgent: agent,
      httpAgent: agent,
      timeout: 15000,
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      }
    });

    const $ = cheerio.load(response.data);

    return {
      title: $('h1').first().text().trim(),
      description: $('meta[name="description"]').attr('content'),
      proxy: `${proxy.host}:${proxy.port}`
    };
  } catch (error) {
    console.error(`Proxy ${proxy.host}:${proxy.port} failed:`, error.message);
    throw error;
  }
}

// Scrape multiple URLs with rotation
async function scrapeMultipleUrls(urls) {
  const results = [];

  for (const url of urls) {
    try {
      const data = await scrapeWithRotatingProxy(url);
      results.push(data);

      // Add delay between requests
      await new Promise(resolve => setTimeout(resolve, 2000));
    } catch (error) {
      console.error(`Failed to scrape ${url}:`, error.message);
    }
  }

  return results;
}

Using Built-in HTTP Module with Proxies

For more control, you can use Node.js's built-in HTTP module with tunnel agents:

const http = require('http');
const https = require('https');
const cheerio = require('cheerio');
const tunnel = require('tunnel');

// Configure HTTP tunnel through proxy
const tunnelingAgent = tunnel.httpsOverHttp({
  proxy: {
    host: 'proxy-server.com',
    port: 8080,
    proxyAuth: 'username:password'
  }
});

function scrapeWithTunnel(url) {
  return new Promise((resolve, reject) => {
    const options = {
      agent: tunnelingAgent,
      headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; Node.js scraper)'
      }
    };

    https.get(url, options, (response) => {
      let data = '';

      response.on('data', (chunk) => {
        data += chunk;
      });

      response.on('end', () => {
        const $ = cheerio.load(data);

        // Extract structured data
        const jsonLd = [];
        $('script[type="application/ld+json"]').each((i, element) => {
          try {
            jsonLd.push(JSON.parse($(element).html()));
          } catch (e) {
            // Skip invalid JSON-LD
          }
        });

        resolve({
          title: $('title').text(),
          jsonLd,
          statusCode: response.statusCode
        });
      });
    }).on('error', reject);
  });
}

Advanced Proxy Configuration with Error Handling

Here's a robust implementation with error handling and retry logic:

const axios = require('axios');
const cheerio = require('cheerio');
const { HttpsProxyAgent } = require('https-proxy-agent');

class RobustProxyScraper {
  constructor(proxies, options = {}) {
    this.proxies = proxies;
    this.maxRetries = options.maxRetries || 3;
    this.timeout = options.timeout || 10000;
    this.retryDelay = options.retryDelay || 1000;
  }

  async scrapeWithRetry(url, selector) {
    let lastError;

    for (let attempt = 0; attempt < this.maxRetries; attempt++) {
      const proxy = this.proxies[attempt % this.proxies.length];

      try {
        const data = await this.scrapeWithProxy(url, proxy, selector);
        return data;
      } catch (error) {
        lastError = error;
        console.warn(`Attempt ${attempt + 1} failed with proxy ${proxy.host}:`, error.message);

        if (attempt < this.maxRetries - 1) {
          await this.delay(this.retryDelay * (attempt + 1));
        }
      }
    }

    throw new Error(`All retry attempts failed. Last error: ${lastError.message}`);
  }

  async scrapeWithProxy(url, proxy, selector) {
    const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
    const agent = new HttpsProxyAgent(proxyUrl);

    const response = await axios.get(url, {
      httpsAgent: agent,
      httpAgent: agent,
      timeout: this.timeout,
      headers: {
        'User-Agent': this.getRandomUserAgent()
      }
    });

    const $ = cheerio.load(response.data);

    if (selector) {
      return $(selector).map((i, el) => $(el).text().trim()).get();
    }

    return {
      title: $('title').text().trim(),
      headings: $('h1, h2, h3').map((i, el) => $(el).text().trim()).get(),
      links: $('a[href]').map((i, el) => ({
        text: $(el).text().trim(),
        href: $(el).attr('href')
      })).get()
    };
  }

  getRandomUserAgent() {
    const userAgents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
      'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    ];
    return userAgents[Math.floor(Math.random() * userAgents.length)];
  }

  delay(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage example
const proxies = [
  { host: 'proxy1.com', port: 8080, username: 'user1', password: 'pass1' },
  { host: 'proxy2.com', port: 8080, username: 'user2', password: 'pass2' }
];

const scraper = new RobustProxyScraper(proxies, {
  maxRetries: 3,
  timeout: 15000,
  retryDelay: 2000
});

scraper.scrapeWithRetry('https://example.com', '.product-title')
  .then(data => console.log('Scraped data:', data))
  .catch(err => console.error('Scraping failed:', err.message));

Best Practices for Proxy-Based Scraping

1. Proxy Pool Management

Maintain a healthy pool of working proxies and regularly test their functionality:

# Test proxy connectivity
curl --proxy http://username:password@proxy-server.com:8080 https://httpbin.org/ip

2. Request Rate Limiting

Implement delays between requests to avoid overwhelming target servers:

// Add random delays between requests
const delay = Math.random() * 3000 + 1000; // 1-4 seconds
await new Promise(resolve => setTimeout(resolve, delay));

3. Error Handling and Fallbacks

Always implement comprehensive error handling for network failures, proxy timeouts, and blocked requests.

4. Monitoring and Logging

Track proxy performance and success rates to identify problematic proxies:

const proxyStats = new Map();

function updateProxyStats(proxyHost, success) {
  if (!proxyStats.has(proxyHost)) {
    proxyStats.set(proxyHost, { success: 0, failure: 0 });
  }

  const stats = proxyStats.get(proxyHost);
  success ? stats.success++ : stats.failure++;
}

Troubleshooting Common Issues

Connection Timeouts

Increase timeout values and implement retry logic for unreliable proxies.

Authentication Failures

Verify proxy credentials and ensure proper URL encoding for special characters in usernames/passwords.

IP Blocking

Rotate proxies more frequently and implement longer delays between requests.

Alternative Approaches

For more complex scenarios requiring JavaScript execution, consider using headless browsers with proxy support, which can handle dynamic content that Cheerio alone cannot process.

When dealing with heavily protected sites, you might need to implement more sophisticated techniques like handling browser sessions or managing complex authentication flows.

Conclusion

Using Cheerio with proxy servers requires combining Cheerio's HTML parsing capabilities with HTTP clients that support proxy configuration. By implementing proper proxy rotation, error handling, and rate limiting, you can create robust web scraping solutions that can handle large-scale data extraction while avoiding common blocking mechanisms.

Remember to always respect robots.txt files, implement appropriate delays between requests, and comply with the terms of service of the websites you're scraping. Proper proxy usage not only helps avoid detection but also ensures your scraping operations remain ethical and sustainable.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon