How can I extract data from a webpage using Playwright?

Playwright is a powerful Node.js library that allows you to automate browser tasks such as data extraction on both Chromium and Firefox. Here's how you can use it to extract data from a webpage.

To start off, you need to install Playwright. You can do this by running the following command in your console:

npm install playwright

This will install Playwright and the browser binaries for Chromium, Firefox and WebKit.

Next, let's say we want to extract the title of a webpage. Here is how you can do it:

const playwright = require('playwright');

(async () => {
  const browser = await playwright['chromium'].launch();
  const context = await browser.newContext();
  const page = await context.newPage();
  await page.goto('http://example.com');
  const title = await page.title();
  console.log(title);
  await browser.close();
})();

In the above code, we first require the playwright module. We then launch a new Chromium browser, create a new browser context, and open a new page. We navigate to 'http://example.com' using page.goto(). The page.title() function is used to get the title of the webpage. We then log the title to the console and close the browser.

If you want to extract specific elements from the webpage, you can use the page.$(selector) function to get the first element that matches the selector. The page.$$(selector) function can be used to get all elements that match the selector. For example, if you want to extract all paragraph texts from the webpage, you can do it like this:

const playwright = require('playwright');

(async () => {
  const browser = await playwright['chromium'].launch();
  const context = await browser.newContext();
  const page = await context.newPage();
  await page.goto('http://example.com');
  const paragraphs = await page.$$('p');
  for (const paragraph of paragraphs) {
    const text = await paragraph.textContent();
    console.log(text);
  }
  await browser.close();
})();

In the above code, page.$$('p') returns all paragraph elements on the webpage. We then loop over these elements and get their text content using the element.textContent() function.

Remember that page.$(selector) and page.$$(selector) return handle objects, not the actual elements. To interact with the elements, you need to use the functions provided by the handle objects, such as element.textContent(), element.innerHTML(), element.outerHTML(), etc.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon