How can Puppeteer be used for web scraping?

Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.

Setting up Puppeteer

Before starting, you have to install Puppeteer. Run the following command to install Puppeteer in your project:

npm i puppeteer

Using Puppeteer for Web Scraping

Here's a basic example of how Puppeteer can be used for web scraping:

const puppeteer = require('puppeteer');

async function scrapeProduct(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);

    const [el] = await page.$x('//*[@id="imgBlkFront"]');
    const src = await el.getProperty('src');
    const imageUrl = await src.jsonValue();

    const [el2] = await page.$x('//*[@id="productTitle"]');
    const txt = await el2.getProperty('textContent');
    const title = await txt.jsonValue();

    const [el3] = await page.$x('//*[@id="priceblock_ourprice"]');
    const txt2 = await el3.getProperty('textContent');
    const price = await txt2.jsonValue();


    console.log({imageUrl, title, price});

    browser.close();
}

scrapeProduct('https://www.amazon.com/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X');

This script tells Puppeteer to launch a new web browser, and go to a specific URL. Then, it finds a particular element on the page using XPath, and retrieves its src and textContent properties.

In the example above, it retrieves the image link, title, and price of a product on Amazon, and logs them to the console.

Notes

While Puppeteer is a powerful tool for web scraping, it's important to note that it's also resource-intensive as it launches a full version of the Chrome browser in the background. If performance is a concern, you may want to consider other options like Cheerio or jsdom which only parse HTML and do not involve a full browser.

Also, remember to respect the website's robots.txt file and do not overload the website with too many requests in a short period of time.

And finally, when you're scraping a site, keep in mind that the site's structure might change over time, so you need to update your selectors.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon