How to Use XPath Axes Like Following-Sibling and Preceding-Sibling
XPath axes are powerful navigation tools that allow you to traverse HTML documents in different directions from a context node. The following-sibling
and preceding-sibling
axes are particularly useful for web scraping scenarios where you need to navigate horizontally between elements at the same hierarchical level.
Understanding XPath Sibling Axes
XPath sibling axes operate on elements that share the same parent node. These axes are essential when you need to:
- Extract data from table rows or columns
- Navigate between form fields
- Process lists or menu items
- Handle dynamic content where element relationships matter more than absolute positions
Following-Sibling Axis
The following-sibling
axis selects all siblings that appear after the current node in document order.
Syntax: following-sibling::node-test[predicate]
Preceding-Sibling Axis
The preceding-sibling
axis selects all siblings that appear before the current node in document order.
Syntax: preceding-sibling::node-test[predicate]
Practical Examples with Code
HTML Structure for Examples
Let's work with this sample HTML structure:
<div class="product-info">
<h2>Product Title</h2>
<p class="price">$29.99</p>
<p class="description">Product description here</p>
<div class="rating">4.5 stars</div>
<button class="add-to-cart">Add to Cart</button>
<span class="availability">In Stock</span>
</div>
Python Examples with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Setup Chrome driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
try:
driver.get("https://example.com")
# Find all elements following the price element
following_elements = driver.find_elements(
By.XPATH,
"//p[@class='price']/following-sibling::*"
)
print("Elements following the price:")
for element in following_elements:
print(f"Tag: {element.tag_name}, Text: {element.text}")
# Find the first paragraph following the title
next_paragraph = driver.find_element(
By.XPATH,
"//h2[text()='Product Title']/following-sibling::p[1]"
)
print(f"First paragraph after title: {next_paragraph.text}")
# Find all elements preceding the rating
preceding_elements = driver.find_elements(
By.XPATH,
"//div[@class='rating']/preceding-sibling::*"
)
print("Elements preceding the rating:")
for element in preceding_elements:
print(f"Tag: {element.tag_name}, Text: {element.text}")
# Find the last element before the button
last_before_button = driver.find_element(
By.XPATH,
"//button[@class='add-to-cart']/preceding-sibling::*[1]"
)
print(f"Element just before button: {last_before_button.text}")
finally:
driver.quit()
Python with lxml
from lxml import html
import requests
# Fetch and parse HTML
response = requests.get("https://example.com")
tree = html.fromstring(response.content)
# Find following siblings of price element
following_siblings = tree.xpath("//p[@class='price']/following-sibling::*")
print("Following siblings of price:")
for sibling in following_siblings:
print(f"Tag: {sibling.tag}, Text: {sibling.text_content().strip()}")
# Find preceding siblings of rating
preceding_siblings = tree.xpath("//div[@class='rating']/preceding-sibling::*")
print("Preceding siblings of rating:")
for sibling in preceding_siblings:
print(f"Tag: {sibling.tag}, Text: {sibling.text_content().strip()}")
# More specific queries
next_two_siblings = tree.xpath("//p[@class='price']/following-sibling::*[position() <= 2]")
previous_paragraph = tree.xpath("//div[@class='rating']/preceding-sibling::p[last()]")
JavaScript Examples
// Using XPath in browser console or with libraries like Puppeteer
// Function to evaluate XPath
function getElementByXPath(xpath, contextNode = document) {
return document.evaluate(
xpath,
contextNode,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
}
function getElementsByXPath(xpath, contextNode = document) {
const result = [];
const query = document.evaluate(
xpath,
contextNode,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
for (let i = 0; i < query.snapshotLength; i++) {
result.push(query.snapshotItem(i));
}
return result;
}
// Find following siblings
const followingSiblings = getElementsByXPath("//p[@class='price']/following-sibling::*");
console.log("Following siblings:", followingSiblings);
// Find preceding siblings
const precedingSiblings = getElementsByXPath("//div[@class='rating']/preceding-sibling::*");
console.log("Preceding siblings:", precedingSiblings);
// Find specific sibling by position
const secondFollowing = getElementByXPath("//p[@class='price']/following-sibling::*[2]");
console.log("Second following sibling:", secondFollowing);
Advanced Usage Patterns
Working with Tables
# Extract data from table rows using sibling axes
table_xpath_queries = [
# Get all cells in the same row after finding a specific cell
"//td[text()='Product A']/following-sibling::td",
# Get the previous row's data
"//tr[td[text()='Current Row']]/preceding-sibling::tr[1]/td",
# Get specific column data from following rows
"//tr[td[text()='Header']]/following-sibling::tr/td[2]"
]
for xpath in table_xpath_queries:
elements = driver.find_elements(By.XPATH, xpath)
print(f"XPath: {xpath}")
for elem in elements:
print(f" Text: {elem.text}")
Form Field Navigation
# Navigate between form fields
form_navigation_examples = [
# Find the label text for an input field
"//input[@name='email']/preceding-sibling::label",
# Find error message after an input
"//input[@name='password']/following-sibling::span[@class='error']",
# Get all form fields after a specific field
"//input[@name='firstname']/following-sibling::input"
]
Advanced Techniques and Best Practices
Combining Axes with Predicates
# Complex XPath expressions combining axes and predicates
advanced_examples = [
# Find the second paragraph following an h2 with specific text
"//h2[contains(text(), 'Features')]/following-sibling::p[2]",
# Find preceding sibling div with specific class
"//button[@class='submit']/preceding-sibling::div[@class='form-group'][last()]",
# Find following sibling that contains specific text
"//span[@class='label']/following-sibling::*[contains(text(), 'Available')]",
# Get all following siblings until a specific element
"//h3[@class='section-title']/following-sibling::*[not(self::h3)]"
]
Performance Optimization
When using sibling axes, consider these performance tips:
- Be specific with node tests: Use specific element names instead of
*
when possible - Limit scope with predicates: Use position predicates to limit results
- Cache context nodes: Store frequently used context nodes in variables
# Optimized approach
price_element = driver.find_element(By.XPATH, "//p[@class='price']")
# Reuse the context element for multiple queries
description = driver.find_element(By.XPATH, "./following-sibling::p[@class='description']", price_element)
rating = driver.find_element(By.XPATH, "./following-sibling::div[@class='rating']", price_element)
Common Use Cases in Web Scraping
E-commerce Product Pages
def scrape_product_details(driver):
"""Extract product information using sibling navigation"""
# Find product title and get related information
title_element = driver.find_element(By.XPATH, "//h1[@class='product-title']")
# Get price (usually follows title)
price = driver.find_element(
By.XPATH,
"//h1[@class='product-title']/following-sibling::*//span[@class='price']"
).text
# Get description (often in next sibling paragraph)
description = driver.find_element(
By.XPATH,
"//h1[@class='product-title']/following-sibling::p[1]"
).text
# Get availability status
availability = driver.find_element(
By.XPATH,
"//span[@class='price']/following-sibling::span[@class='stock-status']"
).text
return {
'title': title_element.text,
'price': price,
'description': description,
'availability': availability
}
News Article Processing
When working with dynamic content loading, you might need to handle AJAX requests using Puppeteer or wait for elements to load properly with Puppeteer's waitFor function.
def extract_article_metadata(driver):
"""Extract article metadata using sibling relationships"""
# Find author and publication date that typically follow the title
author = driver.find_element(
By.XPATH,
"//h1[@class='article-title']/following-sibling::div[@class='byline']//span[@class='author']"
).text
date = driver.find_element(
By.XPATH,
"//span[@class='author']/following-sibling::time"
).get_attribute('datetime')
# Get article tags that usually precede or follow content
tags = [tag.text for tag in driver.find_elements(
By.XPATH,
"//div[@class='article-content']/following-sibling::div[@class='tags']//a"
)]
return {
'author': author,
'date': date,
'tags': tags
}
Troubleshooting Common Issues
Element Not Found Errors
def safe_sibling_extraction(driver, xpath):
"""Safely extract sibling elements with error handling"""
try:
elements = driver.find_elements(By.XPATH, xpath)
if elements:
return [elem.text for elem in elements]
else:
print(f"No elements found for XPath: {xpath}")
return []
except Exception as e:
print(f"Error extracting elements: {e}")
return []
# Usage
following_data = safe_sibling_extraction(
driver,
"//p[@class='price']/following-sibling::*"
)
Dynamic Content Handling
For pages with dynamic content, consider waiting for elements to load:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait for sibling elements to be present
wait = WebDriverWait(driver, 10)
sibling_elements = wait.until(
EC.presence_of_all_elements_located(
(By.XPATH, "//div[@class='loaded-content']/following-sibling::div")
)
)
Integration with Web Scraping APIs
When building scalable scraping solutions, you might need to combine XPath sibling navigation with robust scraping infrastructure. The WebScraping.AI API provides powerful XPath support for complex element selection and data extraction workflows.
import requests
# Example using WebScraping.AI API with XPath
api_url = "https://api.webscraping.ai/html"
params = {
'url': 'https://example.com',
'selector': '//p[@class="price"]/following-sibling::*',
'selector_type': 'xpath'
}
response = requests.get(api_url, params=params)
selected_elements = response.json()
Conclusion
XPath sibling axes are powerful tools for navigating HTML documents horizontally, enabling precise element selection based on structural relationships. The following-sibling
and preceding-sibling
axes are particularly valuable in web scraping scenarios where you need to extract related data points or navigate between form elements.
Key takeaways: - Use sibling axes when element position relationships are more reliable than absolute paths - Combine axes with predicates for precise element selection - Consider performance implications and optimize XPath expressions - Implement proper error handling for robust scraping applications - Practice with different HTML structures to master these navigation techniques
By mastering XPath sibling navigation, you'll be able to create more flexible and maintainable web scraping solutions that can adapt to various HTML structures and layout changes.