How do I extract data from a website using Puppeteer?

Scraping data from a website using Puppeteer involves several steps, including installing Puppeteer, launching a browser, opening a new page, navigating to the required website, and then extracting the needed data.

Here's a step-by-step guide on how you can do this:

Step 1: Install Puppeteer

Puppeteer is a Node library, so you'll need to have Node.js installed in your system. You can install Puppeteer using npm (Node package manager) with the following command:

npm i puppeteer

Step 2: Launch a Browser

Create a new JavaScript file (for example, scraper.js), and use the following code to launch a new browser:

const puppeteer = require('puppeteer');

async function scrape() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Rest of the code
}

scrape();

Step 3: Navigate to the Page

To navigate to a page, use the page.goto() function. Let's say you want to scrape data from a Wikipedia page:

await page.goto('https://en.wikipedia.org/wiki/Web_scraping');

Step 4: Extract the Data

To extract data, you'll use the page.evaluate() function. Here's an example of how to extract the title and the first paragraph of the Wikipedia page:

const result = await page.evaluate(() => {
    let title = document.querySelector('h1').innerText;
    let firstParagraph = document.querySelector('p').innerText;
    return {title, firstParagraph}
});

console.log(result);

Full Code

Here's the full code combining all the steps:

const puppeteer = require('puppeteer');

async function scrape() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://en.wikipedia.org/wiki/Web_scraping');

    const result = await page.evaluate(() => {
        let title = document.querySelector('h1').innerText;
        let firstParagraph = document.querySelector('p').innerText;
        return {title, firstParagraph}
    });

    console.log(result);

    await browser.close();
}

scrape();

You can run this script using the following command:

node scraper.js

This script will output the title and first paragraph of the Wikipedia page on Web Scraping.

Remember, you might need to adjust the selectors ('h1' and 'p' in this example) based on the structure of the webpage you are scraping. You can find the right selectors by inspecting the webpage.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon