Table of contents

How do you use Cheerio with HTTP request libraries like Axios or Fetch?

Cheerio is a powerful server-side HTML parsing library that implements jQuery's core functionality for Node.js applications. While Cheerio excels at parsing and manipulating HTML content, it doesn't handle HTTP requests. This is where HTTP request libraries like Axios and Fetch come in, creating a powerful combination for web scraping projects.

Understanding the Cheerio + HTTP Library Workflow

The typical workflow involves three main steps:

  1. Fetch HTML content using an HTTP library (Axios, Fetch, or others)
  2. Parse the HTML using Cheerio to create a jQuery-like object
  3. Extract and manipulate data using Cheerio's selectors and methods

This approach gives you the flexibility of modern HTTP clients combined with jQuery's familiar DOM manipulation syntax.

Using Cheerio with Axios

Axios is a popular HTTP client library that provides a clean API for making HTTP requests. Here's how to combine it with Cheerio:

Basic Setup and Installation

npm install cheerio axios

Simple Web Scraping Example

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWebsite(url) {
  try {
    // Step 1: Fetch HTML content
    const response = await axios.get(url);

    // Step 2: Load HTML into Cheerio
    const $ = cheerio.load(response.data);

    // Step 3: Extract data using jQuery-like selectors
    const title = $('title').text();
    const headings = [];

    $('h1, h2, h3').each((index, element) => {
      headings.push($(element).text().trim());
    });

    return {
      title,
      headings,
      url
    };
  } catch (error) {
    console.error('Scraping failed:', error.message);
    throw error;
  }
}

// Usage
scrapeWebsite('https://example.com')
  .then(data => console.log(data))
  .catch(error => console.error(error));

Advanced Axios Configuration

For production web scraping, you'll often need to configure headers, timeouts, and other options:

const axios = require('axios');
const cheerio = require('cheerio');

// Create an Axios instance with custom configuration
const httpClient = axios.create({
  timeout: 10000,
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
  }
});

async function scrapeWithCustomHeaders(url) {
  try {
    const response = await httpClient.get(url);
    const $ = cheerio.load(response.data);

    // Extract product information (e-commerce example)
    const products = [];
    $('.product-item').each((index, element) => {
      const $product = $(element);
      products.push({
        name: $product.find('.product-name').text().trim(),
        price: $product.find('.price').text().trim(),
        image: $product.find('img').attr('src'),
        link: $product.find('a').attr('href')
      });
    });

    return products;
  } catch (error) {
    if (error.response) {
      console.error(`HTTP Error: ${error.response.status} - ${error.response.statusText}`);
    } else {
      console.error('Request failed:', error.message);
    }
    throw error;
  }
}

Using Cheerio with Fetch API

The Fetch API is native to modern JavaScript environments and provides a more modern approach to HTTP requests:

Basic Fetch + Cheerio Example

const cheerio = require('cheerio');
const fetch = require('node-fetch'); // For Node.js environments

async function scrapeWithFetch(url) {
  try {
    // Step 1: Fetch HTML content
    const response = await fetch(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)',
        'Accept': 'text/html,application/xhtml+xml'
      }
    });

    // Check if request was successful
    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }

    // Step 2: Get HTML text
    const html = await response.text();

    // Step 3: Parse with Cheerio
    const $ = cheerio.load(html);

    // Extract navigation links
    const navLinks = [];
    $('nav a, .navigation a, .menu a').each((index, element) => {
      const $link = $(element);
      const href = $link.attr('href');
      const text = $link.text().trim();

      if (href && text) {
        navLinks.push({
          text,
          href: new URL(href, url).href // Convert relative URLs to absolute
        });
      }
    });

    return navLinks;
  } catch (error) {
    console.error('Fetch error:', error.message);
    throw error;
  }
}

Browser Environment Example

In browser environments, you can use the native Fetch API without additional dependencies:

// Browser-compatible version
async function scrapeBrowserContent(url) {
  try {
    const response = await fetch(url, {
      mode: 'cors', // Handle CORS if needed
      credentials: 'same-origin'
    });

    const html = await response.text();
    const $ = cheerio.load(html);

    // Extract metadata
    const metadata = {
      title: $('title').text(),
      description: $('meta[name="description"]').attr('content'),
      keywords: $('meta[name="keywords"]').attr('content'),
      ogTitle: $('meta[property="og:title"]').attr('content'),
      ogDescription: $('meta[property="og:description"]').attr('content'),
      ogImage: $('meta[property="og:image"]').attr('content')
    };

    return metadata;
  } catch (error) {
    console.error('Browser scraping error:', error);
    throw error;
  }
}

Handling Different Content Types and Encodings

Sometimes you'll encounter different character encodings or content types:

const axios = require('axios');
const cheerio = require('cheerio');
const iconv = require('iconv-lite');

async function handleEncodingIssues(url) {
  try {
    const response = await axios.get(url, {
      responseType: 'arraybuffer', // Get raw binary data
      headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
      }
    });

    // Detect encoding from Content-Type header
    const contentType = response.headers['content-type'] || '';
    let encoding = 'utf-8';

    const charsetMatch = contentType.match(/charset=([^;]+)/i);
    if (charsetMatch) {
      encoding = charsetMatch[1].toLowerCase();
    }

    // Convert buffer to string with proper encoding
    const html = iconv.decode(Buffer.from(response.data), encoding);
    const $ = cheerio.load(html);

    // Now extract data normally
    const content = {
      title: $('title').text(),
      paragraphs: $('p').map((i, el) => $(el).text().trim()).get()
    };

    return content;
  } catch (error) {
    console.error('Encoding handling error:', error);
    throw error;
  }
}

Error Handling and Retry Logic

Robust web scraping requires proper error handling and retry mechanisms:

const axios = require('axios');
const cheerio = require('cheerio');

class WebScraper {
  constructor(options = {}) {
    this.maxRetries = options.maxRetries || 3;
    this.retryDelay = options.retryDelay || 1000;
    this.timeout = options.timeout || 10000;
  }

  async scrapeWithRetry(url, retryCount = 0) {
    try {
      const response = await axios.get(url, {
        timeout: this.timeout,
        headers: {
          'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
        }
      });

      const $ = cheerio.load(response.data);
      return this.extractData($);

    } catch (error) {
      if (retryCount < this.maxRetries) {
        console.log(`Retry ${retryCount + 1}/${this.maxRetries} for ${url}`);
        await this.delay(this.retryDelay * (retryCount + 1)); // Exponential backoff
        return this.scrapeWithRetry(url, retryCount + 1);
      }

      throw new Error(`Failed to scrape ${url} after ${this.maxRetries} retries: ${error.message}`);
    }
  }

  extractData($) {
    return {
      title: $('title').text().trim(),
      links: $('a').map((i, el) => ({
        text: $(el).text().trim(),
        href: $(el).attr('href')
      })).get().filter(link => link.href && link.text)
    };
  }

  delay(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const scraper = new WebScraper({ maxRetries: 3, retryDelay: 2000 });
scraper.scrapeWithRetry('https://example.com')
  .then(data => console.log(data))
  .catch(error => console.error(error));

Performance Optimization and Best Practices

Concurrent Scraping with Rate Limiting

const axios = require('axios');
const cheerio = require('cheerio');

class ConcurrentScraper {
  constructor(concurrency = 5, delay = 1000) {
    this.concurrency = concurrency;
    this.delay = delay;
    this.queue = [];
    this.running = 0;
  }

  async scrapeUrls(urls) {
    const results = [];

    for (let i = 0; i < urls.length; i += this.concurrency) {
      const batch = urls.slice(i, i + this.concurrency);
      const batchPromises = batch.map(url => this.scrapeSingle(url));

      const batchResults = await Promise.allSettled(batchPromises);
      results.push(...batchResults);

      // Add delay between batches to respect rate limits
      if (i + this.concurrency < urls.length) {
        await new Promise(resolve => setTimeout(resolve, this.delay));
      }
    }

    return results;
  }

  async scrapeSingle(url) {
    try {
      const response = await axios.get(url, {
        timeout: 10000,
        headers: {
          'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
        }
      });

      const $ = cheerio.load(response.data);

      return {
        url,
        success: true,
        data: {
          title: $('title').text().trim(),
          headings: $('h1, h2, h3').map((i, el) => $(el).text().trim()).get(),
          links: $('a[href]').length
        }
      };
    } catch (error) {
      return {
        url,
        success: false,
        error: error.message
      };
    }
  }
}

// Usage
const scraper = new ConcurrentScraper(3, 2000); // 3 concurrent requests, 2s delay
const urls = ['https://example1.com', 'https://example2.com', 'https://example3.com'];

scraper.scrapeUrls(urls)
  .then(results => {
    results.forEach(result => {
      if (result.status === 'fulfilled' && result.value.success) {
        console.log('Success:', result.value.data);
      } else {
        console.log('Failed:', result.value.error);
      }
    });
  });

Alternative Approaches for Dynamic Content

While Cheerio + HTTP libraries work well for static content, some websites load content dynamically via JavaScript. For such cases, you might need to consider browser automation tools like how to handle AJAX requests using Puppeteer or explore how to crawl a single page application (SPA) using Puppeteer for more complex scenarios.

Conclusion

Combining Cheerio with HTTP request libraries like Axios or Fetch provides a powerful, lightweight solution for web scraping. This approach offers excellent performance for static content while maintaining the familiar jQuery syntax that many developers appreciate. The key advantages include:

  • Lightweight: No browser overhead compared to headless browser solutions
  • Fast: Direct HTTP requests are much faster than browser automation
  • Familiar: jQuery-like syntax for DOM manipulation
  • Flexible: Easy to customize headers, handle authentication, and manage sessions

Remember to always respect websites' robots.txt files, implement proper rate limiting, and be mindful of the terms of service when scraping web content. For complex JavaScript-heavy applications, consider combining this approach with browser automation tools when necessary.

Whether you choose Axios for its rich feature set or Fetch for its modern API and native support, both work excellently with Cheerio to create robust web scraping solutions. The examples provided here should give you a solid foundation for building your own scraping applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon