XPath (XML Path Language) is a powerful query language for navigating and selecting nodes in XML/HTML documents. In web scraping, XPath excels at traversing the DOM tree to find parent, child, and sibling elements relative to a known node. This guide covers the essential techniques for navigating these relationships with practical examples.
XPath Axes Overview
XPath provides several axes for navigating the DOM tree structure. The most commonly used axes for parent and sibling navigation are:
parent::
- Selects the parent node..
- Shorthand for parent axispreceding-sibling::
- Selects all preceding siblingsfollowing-sibling::
- Selects all following siblings
Navigate to Parent Node
To select the parent of a current node, use the ..
shorthand or the explicit parent::
axis.
Syntax Options:
//element/.. # Parent using shorthand
//element/parent::* # Parent using explicit axis
//element/parent::tagname # Specific parent element type
Example: Find the parent of a div with class "my-class":
//div[@class='my-class']/..
//div[@class='my-class']/parent::*
//div[@class='my-class']/parent::section # Only if parent is a section
Real-world example: Get the table row containing a specific cell:
//td[text()='Total:']/parent::tr
Navigate to Sibling Nodes
Sibling navigation allows you to select elements at the same level in the DOM tree. XPath provides two main axes for sibling selection.
Preceding Siblings
The preceding-sibling::
axis selects all siblings that appear before the current node in document order.
Syntax:
//element/preceding-sibling::* # All preceding siblings
//element/preceding-sibling::tagname # Specific tag type only
//element/preceding-sibling::*[1] # First preceding sibling
//element/preceding-sibling::*[last()] # Last preceding sibling (immediate previous)
Examples:
# Get all preceding div siblings
//div[@class='target']/preceding-sibling::div
# Get the immediately preceding sibling
//div[@class='target']/preceding-sibling::*[last()]
# Get all preceding siblings with specific class
//div[@class='target']/preceding-sibling::*[@class='item']
Following Siblings
The following-sibling::
axis selects all siblings that appear after the current node in document order.
Syntax:
//element/following-sibling::* # All following siblings
//element/following-sibling::tagname # Specific tag type only
//element/following-sibling::*[1] # First following sibling (immediate next)
//element/following-sibling::*[2] # Second following sibling
Examples:
# Get all following div siblings
//div[@class='target']/following-sibling::div
# Get the immediately following sibling
//div[@class='target']/following-sibling::*[1]
# Get next 3 siblings
//div[@class='target']/following-sibling::*[position() <= 3]
Advanced Sibling Selection
Select siblings with conditions:
# Following siblings with specific attributes
//h2[text()='Section 1']/following-sibling::p[@class='content']
# Preceding siblings until another element
//div[@class='footer']/preceding-sibling::div[following-sibling::div[@class='footer']]
# Siblings between two elements
//h2[@id='start']/following-sibling::*[preceding-sibling::h2[@id='start'] and following-sibling::h2[@id='end']]
Practical Examples
Here are complete working examples demonstrating parent and sibling navigation in popular web scraping libraries.
Python with lxml
from lxml import html
import requests
# Sample HTML structure
html_content = """
<div class="container">
<h1>Title</h1>
<div class="item">Item 1</div>
<div class="item target">Item 2 (Target)</div>
<div class="item">Item 3</div>
<p class="description">Description</p>
</div>
"""
tree = html.fromstring(html_content)
# Navigate to parent
parent = tree.xpath('//div[@class="item target"]/parent::*')[0]
print(f"Parent tag: {parent.tag}, class: {parent.get('class')}")
# Get preceding siblings
preceding = tree.xpath('//div[@class="item target"]/preceding-sibling::*')
print(f"Preceding siblings: {[elem.tag for elem in preceding]}")
# Get following siblings
following = tree.xpath('//div[@class="item target"]/following-sibling::*')
print(f"Following siblings: {[elem.tag for elem in following]}")
# Get immediate next sibling
next_sibling = tree.xpath('//div[@class="item target"]/following-sibling::*[1]')[0]
print(f"Next sibling: {next_sibling.tag}, text: {next_sibling.text}")
# Get all sibling divs with class 'item'
sibling_items = tree.xpath('//div[@class="item target"]/preceding-sibling::div[@class="item"] | //div[@class="item target"]/following-sibling::div[@class="item"]')
print(f"Sibling items: {[elem.text for elem in sibling_items]}")
Python with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://example.com')
# Navigate to parent
parent = driver.find_element(By.XPATH, '//div[@class="target"]/parent::*')
# Get preceding siblings
preceding_siblings = driver.find_elements(By.XPATH, '//div[@class="target"]/preceding-sibling::*')
# Get following siblings
following_siblings = driver.find_elements(By.XPATH, '//div[@class="target"]/following-sibling::*')
# Get immediate previous sibling
prev_sibling = driver.find_element(By.XPATH, '//div[@class="target"]/preceding-sibling::*[last()]')
# Extract text from elements
print(f"Parent text: {parent.text}")
print(f"Previous sibling text: {prev_sibling.text}")
driver.quit()
JavaScript with Puppeteer
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set HTML content for demonstration
await page.setContent(`
<div class="container">
<h1>Title</h1>
<div class="item">Item 1</div>
<div class="item target">Item 2 (Target)</div>
<div class="item">Item 3</div>
<p class="description">Description</p>
</div>
`);
// Helper function to evaluate XPath and get element details
const getElementInfo = async (xpath) => {
const elements = await page.$x(xpath);
const info = [];
for (const element of elements) {
const tagName = await element.evaluate(el => el.tagName.toLowerCase());
const text = await element.evaluate(el => el.textContent.trim());
const className = await element.evaluate(el => el.className);
info.push({ tagName, text, className });
}
return info;
};
// Navigate to parent
const parent = await getElementInfo('//div[contains(@class,"target")]/parent::*');
console.log('Parent:', parent);
// Get preceding siblings
const preceding = await getElementInfo('//div[contains(@class,"target")]/preceding-sibling::*');
console.log('Preceding siblings:', preceding);
// Get following siblings
const following = await getElementInfo('//div[contains(@class,"target")]/following-sibling::*');
console.log('Following siblings:', following);
// Get immediate next sibling
const nextSibling = await getElementInfo('//div[contains(@class,"target")]/following-sibling::*[1]');
console.log('Next sibling:', nextSibling);
await browser.close();
})();
JavaScript in Browser Console
// For testing XPath in browser developer tools
function testXPath(xpath) {
const result = document.evaluate(
xpath,
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
const elements = [];
for (let i = 0; i < result.snapshotLength; i++) {
elements.push(result.snapshotItem(i));
}
return elements;
}
// Usage examples:
const parents = testXPath('//div[@class="target"]/parent::*');
const siblings = testXPath('//div[@class="target"]/following-sibling::*');
console.log('Found elements:', parents, siblings);
Common Use Cases
Table Navigation
# Get the header row for a data cell
//td[text()='$1,234']/ancestor::tr/preceding-sibling::tr[1]
# Get all cells in the same column
//td[text()='Price']/parent::tr/following-sibling::tr/td[position()=count(//td[text()='Price']/preceding-sibling::td)+1]
List Navigation
# Get the next list item
//li[contains(text(),'Current Item')]/following-sibling::li[1]
# Get all items until the next section
//h2[text()='Section A']/following-sibling::ul[1]/li
Form Element Navigation
# Get the label for an input field
//input[@name='email']/preceding-sibling::label[1]
# Get error message following an input
//input[@name='password']/following-sibling::div[@class='error'][1]
Best Practices
- Use specific selectors: Combine axes with predicates for precise targeting
- Handle missing elements: Always check if elements exist before accessing properties
- Consider performance: Sibling axes can be slower than descendant axes for large documents
- Test thoroughly: XPath behavior can vary between parsers and browsers
- Use position functions wisely:
[1]
for first,[last()]
for last,[position() <= n]
for ranges
Error Handling
# Python example with error handling
def safe_xpath(tree, xpath, default=None):
try:
result = tree.xpath(xpath)
return result[0] if result else default
except Exception as e:
print(f"XPath error: {e}")
return default
# Usage
parent = safe_xpath(tree, '//div[@class="target"]/parent::*')
if parent is not None:
print(f"Parent found: {parent.tag}")
else:
print("No parent found or XPath error")