Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server in Node.js. If you want to scrape data from a table using Cheerio, you'll need to follow several steps:
- Fetch the HTML content of the page that contains the table.
- Load the HTML content into Cheerio.
- Use Cheerio's jQuery-like selectors to find the table and extract the data you need.
Below is a step-by-step guide with an example of how you could use Cheerio to scrape data from a table:
Step 1: Install Cheerio and Request-Promise
Before you start, you need to install Cheerio and request-promise, which will be used to make HTTP requests. You can install these using npm:
npm install cheerio request-promise
Step 2: Request the HTML page
Use request-promise to fetch the HTML content of the page. Here's how you do this in a Node.js script:
const rp = require('request-promise');
const cheerio = require('cheerio');
const url = 'YOUR_TARGET_URL';
rp(url)
.then(function(html) {
// The HTML content of the page is now in 'html'
// You can now load it into Cheerio and scrape the data from the table
})
.catch(function(err) {
// Handle errors
});
Step 3: Load HTML content into Cheerio
Once you have the HTML content, you can load it into Cheerio to access the DOM elements:
// Inside the .then function after fetching the HTML content
const $ = cheerio.load(html);
Step 4: Select the table and extract data
Use Cheerio's jQuery-like selectors to select the table and iterate over its rows to extract the data. Assuming you have a simple table structure, here's how you can scrape data from it:
// Inside the .then function after loading the content into Cheerio
const tableData = [];
$('table > tbody > tr').each(function() {
const row = $(this).find('td').map((i, el) => $(el).text()).get();
tableData.push(row);
});
console.log(tableData);
This script will log an array of rows, each being an array of cell values.
Here's a complete example that puts it all together:
const rp = require('request-promise');
const cheerio = require('cheerio');
const url = 'YOUR_TARGET_URL';
rp(url)
.then(function(html) {
const $ = cheerio.load(html);
const tableData = [];
$('table > tbody > tr').each(function() {
const row = $(this).find('td').map((i, el) => $(el).text()).get();
tableData.push(row);
});
console.log(tableData);
})
.catch(function(err) {
console.error(err);
});
Remember to replace 'YOUR_TARGET_URL'
with the actual URL of the page you want to scrape.
Important Notes:
- Make sure you have the legal right to scrape the website. Check the website's
robots.txt
file and terms of service before you start scraping. - Websites' structures change over time, so the selectors you use today might not work tomorrow. Always make your scraper resilient to changes.
- Using
request-promise
is just one method to fetch the HTML content. You can use other libraries like Axios, node-fetch, or the native http module in Node.js. - Respect the website's server load and use appropriate rate limiting or time delays between requests.
- If the website is rendered dynamically using JavaScript, Cheerio might not be sufficient as it does not execute JavaScript. In such cases, consider using Puppeteer or a similar tool that can handle JavaScript rendering.