Combining multiple XPath expressions in web scraping allows you to efficiently select elements that match different criteria in a single query. This technique is essential for complex data extraction scenarios where you need to target multiple types of elements or apply various filtering conditions.
1. Union Operator (|
) - Selecting Multiple Element Sets
The union operator |
combines results from multiple XPath expressions, returning all elements that match any of the provided expressions.
Basic Syntax
expression1 | expression2 | expression3
Example: Selecting Different Element Types
//h1 | //h2 | //h3 // Select all heading elements
//div[@class='title'] | //span[@class='title'] // Select titles from different elements
Sample HTML
<div>
<h1>Main Title</h1>
<div class="content">
<h2>Section Title</h2>
<p class="highlight">Important text</p>
<span class="highlight">Another important text</span>
</div>
</div>
2. Logical Operators in Predicates
Combine conditions within square brackets using logical operators.
OR Operator (or
)
//div[@class='highlight' or @class='important']
//*[@id='main' or @id='secondary']
//input[@type='text' or @type='email']
AND Operator (and
)
//div[@class='content' and @data-type='article']
//a[@href and @title] // Links with both href and title attributes
//img[@src and @alt and @width]
NOT Operator (not()
)
//div[not(@class='hidden')]
//p[not(contains(@class, 'advertisement'))]
//a[not(starts-with(@href, 'javascript:'))]
3. Implementation Examples
Python with lxml
from lxml import html
import requests
# Fetch and parse HTML
response = requests.get('https://example.com')
tree = html.fromstring(response.content)
# Method 1: Union operator for different elements
titles = tree.xpath("//h1 | //h2 | //h3 | //div[@class='title']")
for title in titles:
print(f"Title: {title.text_content().strip()}")
# Method 2: Logical operators in predicates
highlights = tree.xpath("//p[@class='highlight' or @class='important']")
for highlight in highlights:
print(f"Highlighted text: {highlight.text}")
# Method 3: Complex combinations
content_elements = tree.xpath("""
//div[@class='content']//p[not(@class='ads')] |
//article//p |
//section[@data-type='main']//span[@class='text']
""")
# Method 4: Combining with position
first_items = tree.xpath("(//li)[1] | (//div[@class='item'])[1]")
JavaScript (Browser Environment)
// Helper function for XPath evaluation
function evaluateXPath(expression, contextNode = document) {
const result = document.evaluate(
expression,
contextNode,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
const nodes = [];
for (let i = 0; i < result.snapshotLength; i++) {
nodes.push(result.snapshotItem(i));
}
return nodes;
}
// Union operator example
const allTitles = evaluateXPath("//h1 | //h2 | //h3 | //*[@class='title']");
allTitles.forEach(title => console.log(title.textContent));
// Logical operators example
const highlights = evaluateXPath("//p[@class='highlight' or contains(@class, 'important')]");
highlights.forEach(element => console.log(element.textContent));
// Complex combination
const contentElements = evaluateXPath(`
//div[contains(@class, 'content') and not(@class='hidden')]//p |
//article[@data-published='true']//span
`);
JavaScript with Puppeteer
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Use page.$x for XPath queries
const elements = await page.$x(`
//h1 | //h2 | //h3 |
//div[@class='title'] |
//span[@class='subtitle']
`);
for (const element of elements) {
const text = await page.evaluate(el => el.textContent, element);
console.log('Found:', text.trim());
}
await browser.close();
})();
Java with Selenium
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import java.util.List;
public class XPathCombination {
public static void main(String[] args) {
WebDriver driver = new ChromeDriver();
driver.get("https://example.com");
// Union operator
List<WebElement> titles = driver.findElements(By.xpath(
"//h1 | //h2 | //h3 | //div[@class='title']"
));
for (WebElement title : titles) {
System.out.println("Title: " + title.getText());
}
// Logical operators
List<WebElement> highlights = driver.findElements(By.xpath(
"//p[@class='highlight' or contains(@class, 'important')] | " +
"//div[@data-priority='high' and not(@class='hidden')]"
));
driver.quit();
}
}
4. Advanced Combination Techniques
Combining with Functions
// Elements containing specific text OR having specific attributes
//*[contains(text(), 'important') or @data-priority='high']
// Multiple text conditions
//p[contains(text(), 'error') or contains(text(), 'warning') or contains(text(), 'alert')]
// Positional combinations
(//div[@class='item'])[position() <= 3] | (//span[@class='featured'])[1]
Parent-Child Relationships
// Multiple parent-child combinations
//div[@class='header']//a | //nav[@class='menu']//a | //footer//a
// Complex nested conditions
//article[.//h2 and .//p[@class='summary']] |
//section[.//h3 and count(.//p) > 2]
Performance Considerations
- Single Complex Expression vs Multiple Simple Ones:
# More efficient: single complex query
elements = tree.xpath("//div[@class='a'] | //div[@class='b'] | //span[@class='c']")
# Less efficient: multiple queries
elements_a = tree.xpath("//div[@class='a']")
elements_b = tree.xpath("//div[@class='b']")
elements_c = tree.xpath("//span[@class='c']")
combined = elements_a + elements_b + elements_c
- Use Specific Paths When Possible:
// More specific (faster)
/html/body/div[@class='main']//p[@class='content'] |
/html/body/aside//span[@class='sidebar-text']
// Less specific (slower)
//p[@class='content'] | //span[@class='sidebar-text']
5. Programmatic Result Combination
When XPath combination isn't sufficient, combine results in code:
from lxml import html
from collections import OrderedDict
def combine_xpath_results(tree, expressions):
"""Combine multiple XPath expressions and remove duplicates while preserving order"""
seen = set()
combined = []
for expr in expressions:
elements = tree.xpath(expr)
for element in elements:
# Use element's memory address as unique identifier
element_id = id(element)
if element_id not in seen:
seen.add(element_id)
combined.append(element)
return combined
# Usage
expressions = [
"//h1[@class='title']",
"//h2[@class='subtitle']",
"//div[@class='content']//strong",
"//p[contains(@class, 'highlight')]"
]
tree = html.fromstring(html_content)
all_elements = combine_xpath_results(tree, expressions)
Best Practices
- Optimize for Performance: Use specific paths rather than
//*
when possible - Handle Edge Cases: Always check if elements exist before processing
- Avoid Overly Complex Expressions: Break down complex logic for maintainability
- Test Thoroughly: Validate expressions against various HTML structures
- Consider CSS Selectors: Sometimes CSS selectors might be more readable for simple combinations
Combining XPath expressions effectively allows you to extract exactly the data you need in a single query, making your web scraping more efficient and maintainable.