Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.
Setting up Puppeteer
Before starting, you have to install Puppeteer. Run the following command to install Puppeteer in your project:
npm i puppeteer
Using Puppeteer for Web Scraping
Here's a basic example of how Puppeteer can be used for web scraping:
const puppeteer = require('puppeteer');
async function scrapeProduct(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el] = await page.$x('//*[@id="imgBlkFront"]');
const src = await el.getProperty('src');
const imageUrl = await src.jsonValue();
const [el2] = await page.$x('//*[@id="productTitle"]');
const txt = await el2.getProperty('textContent');
const title = await txt.jsonValue();
const [el3] = await page.$x('//*[@id="priceblock_ourprice"]');
const txt2 = await el3.getProperty('textContent');
const price = await txt2.jsonValue();
console.log({imageUrl, title, price});
browser.close();
}
scrapeProduct('https://www.amazon.com/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X');
This script tells Puppeteer to launch a new web browser, and go to a specific URL. Then, it finds a particular element on the page using XPath, and retrieves its src
and textContent
properties.
In the example above, it retrieves the image link, title, and price of a product on Amazon, and logs them to the console.
Notes
While Puppeteer is a powerful tool for web scraping, it's important to note that it's also resource-intensive as it launches a full version of the Chrome browser in the background. If performance is a concern, you may want to consider other options like Cheerio or jsdom which only parse HTML and do not involve a full browser.
Also, remember to respect the website's robots.txt
file and do not overload the website with too many requests in a short period of time.
And finally, when you're scraping a site, keep in mind that the site's structure might change over time, so you need to update your selectors.