How do you use Cheerio to scrape data from a table?

Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server in Node.js. If you want to scrape data from a table using Cheerio, you'll need to follow several steps:

  1. Fetch the HTML content of the page that contains the table.
  2. Load the HTML content into Cheerio.
  3. Use Cheerio's jQuery-like selectors to find the table and extract the data you need.

Below is a step-by-step guide with an example of how you could use Cheerio to scrape data from a table:

Step 1: Install Cheerio and Request-Promise

Before you start, you need to install Cheerio and request-promise, which will be used to make HTTP requests. You can install these using npm:

npm install cheerio request-promise

Step 2: Request the HTML page

Use request-promise to fetch the HTML content of the page. Here's how you do this in a Node.js script:

const rp = require('request-promise');
const cheerio = require('cheerio');

const url = 'YOUR_TARGET_URL';

rp(url)
  .then(function(html) {
    // The HTML content of the page is now in 'html'
    // You can now load it into Cheerio and scrape the data from the table
  })
  .catch(function(err) {
    // Handle errors
  });

Step 3: Load HTML content into Cheerio

Once you have the HTML content, you can load it into Cheerio to access the DOM elements:

// Inside the .then function after fetching the HTML content
const $ = cheerio.load(html);

Step 4: Select the table and extract data

Use Cheerio's jQuery-like selectors to select the table and iterate over its rows to extract the data. Assuming you have a simple table structure, here's how you can scrape data from it:

// Inside the .then function after loading the content into Cheerio
const tableData = [];
$('table > tbody > tr').each(function() {
  const row = $(this).find('td').map((i, el) => $(el).text()).get();
  tableData.push(row);
});

console.log(tableData);

This script will log an array of rows, each being an array of cell values.

Here's a complete example that puts it all together:

const rp = require('request-promise');
const cheerio = require('cheerio');

const url = 'YOUR_TARGET_URL';

rp(url)
  .then(function(html) {
    const $ = cheerio.load(html);
    const tableData = [];
    $('table > tbody > tr').each(function() {
      const row = $(this).find('td').map((i, el) => $(el).text()).get();
      tableData.push(row);
    });

    console.log(tableData);
  })
  .catch(function(err) {
    console.error(err);
  });

Remember to replace 'YOUR_TARGET_URL' with the actual URL of the page you want to scrape.

Important Notes:

  • Make sure you have the legal right to scrape the website. Check the website's robots.txt file and terms of service before you start scraping.
  • Websites' structures change over time, so the selectors you use today might not work tomorrow. Always make your scraper resilient to changes.
  • Using request-promise is just one method to fetch the HTML content. You can use other libraries like Axios, node-fetch, or the native http module in Node.js.
  • Respect the website's server load and use appropriate rate limiting or time delays between requests.
  • If the website is rendered dynamically using JavaScript, Cheerio might not be sufficient as it does not execute JavaScript. In such cases, consider using Puppeteer or a similar tool that can handle JavaScript rendering.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon