Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server to parse, manipulate, and render HTML. While Cheerio itself does not have the capability to directly fetch remote web pages, it can be used in conjunction with HTTP request libraries to parse and manipulate the HTML content of remote web pages after they have been retrieved.
To work with remote web pages using Cheerio, you would typically follow these steps:
- Use an HTTP client like
axios
,node-fetch
, or the nativehttp
module in Node.js to make a request to the remote web page and retrieve its HTML content. - Once you have the HTML content, pass it to Cheerio to create a loaded document.
- Use Cheerio's jQuery-like API to traverse, parse, and manipulate the HTML document.
Here's a basic example in Node.js using axios
and cheerio
:
const axios = require('axios');
const cheerio = require('cheerio');
// URL of the remote web page
const url = 'http://example.com';
// Fetch the remote web page
axios.get(url)
.then(response => {
// Load the web page HTML content into Cheerio
const $ = cheerio.load(response.data);
// Now you can use the standard jQuery methods on the loaded page
// For example, let's change the text of the first <h1> element
$('h1').first().text('Hello, World!');
// Output the modified HTML
console.log($.html());
})
.catch(error => {
console.error('Error fetching the web page:', error);
});
To run the above example, you'll need to install axios
and cheerio
using npm:
npm install axios cheerio
Cheerio does not execute JavaScript on the page, so if the content of the web page is dynamically generated using JavaScript, you might need to use a headless browser like Puppeteer, Playwright, or Selenium that can execute JavaScript and render the page before parsing it with Cheerio.
Remember to always respect the terms of service and robots.txt
of the website you are scraping, and ensure that your scraping activities are legal and ethical.