When scraping with Cheerio, you often encounter relative URLs in the HTML content. Relative URLs are URLs that are not complete paths to the resources on the web, but rather paths that are relative to the current page's URL.
To handle relative URLs when using Cheerio, you can use the url.resolve()
method from Node.js's built-in url
module, or construct absolute URLs manually by combining the base URL of the website with the relative path.
Here's a detailed example of how to handle relative URLs using Cheerio in Node.js:
First, you need to install cheerio
and axios
(or any other HTTP request library) via npm:
npm install cheerio axios
Then, you can use the following code snippet to scrape content and handle relative URLs:
const cheerio = require('cheerio');
const axios = require('axios');
const url = require('url');
// The base URL of the page you are scraping
const BASE_URL = 'https://example.com';
// Function to convert a relative URL to an absolute URL
const toAbsoluteUrl = (relativeUrl, baseUrl) => {
return url.resolve(baseUrl, relativeUrl);
};
// Function to scrape the website
const scrapeWebsite = async (baseUrl) => {
try {
// Fetch the content of the page
const response = await axios.get(baseUrl);
const html = response.data;
// Load the HTML content into Cheerio
const $ = cheerio.load(html);
// Find all the links with relative URLs
$('a').each((index, element) => {
let href = $(element).attr('href');
// Check if the URL is relative
if (href && !href.startsWith('http://') && !href.startsWith('https://')) {
// Convert the relative URL to an absolute URL
href = toAbsoluteUrl(href, baseUrl);
console.log(href); // Log the absolute URL
}
});
} catch (error) {
console.error(`Error: ${error.message}`);
}
};
// Call the function with the base URL
scrapeWebsite(BASE_URL);
In the above code, toAbsoluteUrl()
is a helper function that takes a relative URL and a base URL and uses url.resolve()
to create an absolute URL. This function is then used to iterate over all anchor elements in the page and convert their href
attributes to absolute URLs if they are relative.
Please note that when building a web scraper, you should always check the website's robots.txt
file and Terms of Service to ensure that you are allowed to scrape their content and that you comply with their usage policies. Additionally, it is good practice to respect the website's servers by not sending too many requests in a short period of time.