Table of contents

How to Handle Rate Limiting When Scraping Multiple Pages with Cheerio

Rate limiting is a critical consideration when scraping multiple pages to avoid overwhelming target servers and prevent your IP from being blocked. While Cheerio itself is a server-side HTML parser that doesn't make HTTP requests, implementing proper rate limiting strategies in your scraping workflow is essential for responsible and sustainable web scraping.

Understanding Rate Limiting in Web Scraping

Rate limiting refers to controlling the frequency of requests sent to a server. Most websites implement rate limiting to protect their servers from excessive traffic and ensure fair usage among all users. When scraping multiple pages, you need to respect these limits to avoid:

  • IP address blocking
  • CAPTCHA challenges
  • Temporary or permanent bans
  • Server overload

Basic Rate Limiting with Delays

The simplest approach to rate limiting is implementing delays between requests. Here's how to add delays when scraping multiple pages with Cheerio:

const axios = require('axios');
const cheerio = require('cheerio');

// Simple delay function
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapeMultiplePages(urls) {
    const results = [];

    for (const url of urls) {
        try {
            console.log(`Scraping: ${url}`);

            // Make HTTP request
            const response = await axios.get(url);
            const $ = cheerio.load(response.data);

            // Extract data
            const data = {
                title: $('title').text(),
                headings: $('h1').map((i, el) => $(el).text()).get()
            };

            results.push(data);

            // Add delay between requests (1-3 seconds)
            await delay(Math.random() * 2000 + 1000);

        } catch (error) {
            console.error(`Error scraping ${url}:`, error.message);
        }
    }

    return results;
}

// Usage
const urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
];

scrapeMultiplePages(urls).then(results => {
    console.log('Scraping completed:', results);
});

Advanced Rate Limiting with Queue Management

For more sophisticated rate limiting, implement a queue system that controls the number of concurrent requests:

const axios = require('axios');
const cheerio = require('cheerio');

class RateLimitedScraper {
    constructor(options = {}) {
        this.maxConcurrent = options.maxConcurrent || 2;
        this.delayBetweenRequests = options.delay || 1000;
        this.queue = [];
        this.running = 0;
        this.results = [];
    }

    async scrape(urls) {
        return new Promise((resolve) => {
            this.queue = urls.map(url => ({ url, resolve: null }));
            this.onComplete = resolve;
            this.processQueue();
        });
    }

    async processQueue() {
        while (this.queue.length > 0 && this.running < this.maxConcurrent) {
            const item = this.queue.shift();
            this.running++;
            this.processUrl(item.url);
        }
    }

    async processUrl(url) {
        try {
            await this.delay(this.delayBetweenRequests);

            console.log(`Processing: ${url}`);
            const response = await axios.get(url, {
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
                }
            });

            const $ = cheerio.load(response.data);
            const data = {
                url,
                title: $('title').text().trim(),
                description: $('meta[name="description"]').attr('content') || '',
                links: $('a[href]').map((i, el) => $(el).attr('href')).get()
            };

            this.results.push(data);

        } catch (error) {
            console.error(`Error processing ${url}:`, error.message);
            this.results.push({ url, error: error.message });
        }

        this.running--;

        if (this.queue.length > 0) {
            this.processQueue();
        } else if (this.running === 0) {
            this.onComplete(this.results);
        }
    }

    delay(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

// Usage
async function main() {
    const scraper = new RateLimitedScraper({
        maxConcurrent: 3,
        delay: 2000
    });

    const urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
        'https://example.com/page4',
        'https://example.com/page5'
    ];

    const results = await scraper.scrape(urls);
    console.log('All pages scraped:', results.length);
}

main();

Implementing Exponential Backoff

For handling rate limiting errors (HTTP 429), implement exponential backoff:

const axios = require('axios');
const cheerio = require('cheerio');

class RetryableScraper {
    constructor() {
        this.maxRetries = 3;
        this.baseDelay = 1000;
    }

    async scrapeWithRetry(url, retryCount = 0) {
        try {
            const response = await axios.get(url, {
                timeout: 10000,
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
                }
            });

            if (response.status === 429) {
                throw new Error('Rate limited');
            }

            const $ = cheerio.load(response.data);
            return {
                url,
                success: true,
                data: {
                    title: $('title').text(),
                    paragraphs: $('p').map((i, el) => $(el).text()).get()
                }
            };

        } catch (error) {
            if (retryCount < this.maxRetries && 
                (error.response?.status === 429 || error.message.includes('Rate limited'))) {

                const delay = this.baseDelay * Math.pow(2, retryCount);
                console.log(`Rate limited, retrying in ${delay}ms...`);

                await this.delay(delay);
                return this.scrapeWithRetry(url, retryCount + 1);
            }

            return {
                url,
                success: false,
                error: error.message
            };
        }
    }

    delay(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

// Usage
async function scrapeWithExponentialBackoff(urls) {
    const scraper = new RetryableScraper();
    const results = [];

    for (const url of urls) {
        const result = await scraper.scrapeWithRetry(url);
        results.push(result);

        // Base delay between requests
        await scraper.delay(1500);
    }

    return results;
}

Using Request Pools and Connection Management

Optimize your scraping by reusing connections and managing request pools:

const axios = require('axios');
const cheerio = require('cheerio');
const { Agent } = require('https');

// Create a custom axios instance with connection pooling
const httpClient = axios.create({
    httpsAgent: new Agent({
        keepAlive: true,
        maxSockets: 5,
        maxFreeSockets: 2
    }),
    timeout: 15000,
    headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive'
    }
});

class PooledScraper {
    constructor(options = {}) {
        this.requestsPerSecond = options.requestsPerSecond || 2;
        this.intervalMs = 1000 / this.requestsPerSecond;
        this.lastRequestTime = 0;
    }

    async throttledRequest(url) {
        const now = Date.now();
        const timeSinceLastRequest = now - this.lastRequestTime;

        if (timeSinceLastRequest < this.intervalMs) {
            const waitTime = this.intervalMs - timeSinceLastRequest;
            await this.delay(waitTime);
        }

        this.lastRequestTime = Date.now();
        return httpClient.get(url);
    }

    async scrapePage(url) {
        try {
            const response = await this.throttledRequest(url);
            const $ = cheerio.load(response.data);

            return {
                url,
                status: response.status,
                title: $('title').text(),
                metaDescription: $('meta[name="description"]').attr('content'),
                headings: $('h1, h2, h3').map((i, el) => ({
                    tag: el.tagName,
                    text: $(el).text().trim()
                })).get()
            };

        } catch (error) {
            return {
                url,
                error: error.message,
                status: error.response?.status
            };
        }
    }

    delay(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

Monitoring and Adaptive Rate Limiting

Implement monitoring to adjust rate limiting based on server responses:

class AdaptiveScraper {
    constructor() {
        this.currentDelay = 1000;
        this.minDelay = 500;
        this.maxDelay = 10000;
        this.successCount = 0;
        this.errorCount = 0;
    }

    adjustDelay(wasSuccessful) {
        if (wasSuccessful) {
            this.successCount++;
            this.errorCount = 0;

            // Gradually decrease delay on success
            if (this.successCount >= 5) {
                this.currentDelay = Math.max(
                    this.minDelay, 
                    this.currentDelay * 0.9
                );
                this.successCount = 0;
            }
        } else {
            this.errorCount++;
            this.successCount = 0;

            // Increase delay on errors
            this.currentDelay = Math.min(
                this.maxDelay, 
                this.currentDelay * (1.5 + this.errorCount * 0.5)
            );
        }

        console.log(`Adjusted delay to: ${this.currentDelay}ms`);
    }

    async scrapeWithAdaptiveRate(urls) {
        const results = [];

        for (const url of urls) {
            await this.delay(this.currentDelay);

            try {
                const response = await axios.get(url);
                const $ = cheerio.load(response.data);

                const data = {
                    url,
                    title: $('title').text(),
                    success: true
                };

                results.push(data);
                this.adjustDelay(true);

            } catch (error) {
                results.push({
                    url,
                    error: error.message,
                    success: false
                });
                this.adjustDelay(false);
            }
        }

        return results;
    }

    delay(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

Best Practices for Rate Limiting

  1. Respect robots.txt: Always check the website's robots.txt file for crawl delays and restrictions.

  2. Use realistic delays: Implement delays of 1-5 seconds between requests for most websites.

  3. Monitor response codes: Watch for HTTP 429 (Too Many Requests) and adjust accordingly.

  4. Rotate user agents: Use different user agents to appear more like natural traffic.

  5. Implement circuit breakers: Stop scraping temporarily if error rates become too high.

Integration with Modern Web Scraping

When working with dynamic content that requires JavaScript execution, you might need to integrate with browser automation tools. Consider how to handle browser sessions in Puppeteer for scenarios where Cheerio alone isn't sufficient for parsing JavaScript-rendered content.

For handling complex navigation patterns across multiple pages, how to run multiple pages in parallel with Puppeteer provides strategies that can complement your Cheerio-based scraping approach.

Conclusion

Effective rate limiting is crucial for sustainable web scraping with Cheerio. By implementing proper delays, queue management, retry logic, and adaptive rate limiting, you can scrape multiple pages efficiently while being respectful to target servers. Remember to always monitor your scraping performance and adjust your rate limiting strategy based on the specific requirements and constraints of each website you're scraping.

The key is finding the right balance between scraping speed and server courtesy, ensuring your scraping operations remain undetected and sustainable over time.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon