How to scrape data from nested tags using XPath?

To scrape data from nested tags using XPath, you need to understand how to navigate the DOM (Document Object Model) of the webpage you're trying to scrape. XPath (XML Path Language) is a query language that allows you to navigate through elements and attributes in an XML or HTML document. Here's how you can use XPath expressions to scrape data from nested tags:

Basic XPath Syntax:

  • // - Selects nodes from anywhere in the document
  • / - Selects from the root node or from the current node
  • . - Selects the current node
  • .. - Selects the parent of the current node
  • @ - Selects attributes

Examples:

Let's say you have the following HTML structure:

<html>
  <body>
    <div class="container">
      <ul class="items">
        <li class="item">
          <a href="link1.html">Item 1</a>
          <span class="price">$10</span>
        </li>
        <li class="item">
          <a href="link2.html">Item 2</a>
          <span class="price">$20</span>
        </li>
        <!-- More items... -->
      </ul>
    </div>
  </body>
</html>

To scrape the text of each item, you can use the following XPath expression:

//ul[@class='items']/li/a/text()

To scrape the prices, you would use:

//ul[@class='items']/li/span[@class='price']/text()

If you want to get both the item text and price together, you might use:

//ul[@class='items']/li

And then, for each li element, you would further query for a/text() and span[@class='price']/text().

Python Example with lxml:

from lxml import html
import requests

# Fetch the webpage
url = 'http://example.com'
response = requests.get(url)

# Parse the content
tree = html.fromstring(response.content)

# Scrape item names
items = tree.xpath('//ul[@class="items"]/li/a/text()')

# Scrape prices
prices = tree.xpath('//ul[@class="items"]/li/span[@class="price"]/text()')

# Print results
for item, price in zip(items, prices):
    print(f'{item}: {price}')

JavaScript Example with puppeteer and xpath:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://example.com');

  // Scrape item names
  const itemNames = await page.evaluate(() => {
    const items = [];
    const elements = document.querySelectorAll('ul.items li a');
    for (let element of elements) {
      items.push(element.textContent);
    }
    return items;
  });

  // Scrape prices
  const itemPrices = await page.evaluate(() => {
    const prices = [];
    const elements = document.querySelectorAll('ul.items li span.price');
    for (let element of elements) {
      prices.push(element.textContent);
    }
    return prices;
  });

  // Output results
  for (let i = 0; i < itemNames.length; i++) {
    console.log(`${itemNames[i]}: ${itemPrices[i]}`);
  }

  await browser.close();
})();

Note:

  • When scraping websites, always ensure you have permission to scrape and that you are in compliance with the website's robots.txt file and terms of service.
  • Websites may change their structure, so XPath expressions may need to be updated accordingly.
  • Some websites may load content dynamically using JavaScript, requiring tools like puppeteer to execute the JavaScript before scraping.

The provided examples show how to extract text from nested tags using XPath in both Python and JavaScript. Remember that web scraping can be a complex task, and you may need to adjust the XPath expressions to target the specific content you need from a webpage.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon