Multi-valued attributes in HTML contain multiple space-separated values, with the class
attribute being the most common example. When web scraping, you need specialized XPath techniques to accurately select elements based on these attributes while avoiding false matches.
Understanding Multi-Valued Attributes
Consider this HTML structure:
<div class="btn primary large active">Button 1</div>
<div class="btn-secondary small">Button 2</div>
<div class="container btn-primary">Button 3</div>
<div data-tags="red blue green">Color Box</div>
Each element has attributes with multiple values that require careful handling to avoid incorrect matches.
Core XPath Functions
1. Using contains()
- Most Common
The contains()
function checks if an attribute contains a specific substring:
//div[contains(@class, 'btn')]
Real-world example:
from lxml import html
html_content = '''
<div class="btn primary large">Submit</div>
<div class="button secondary">Cancel</div>
<div class="btn-group">Group</div>
'''
tree = html.fromstring(html_content)
# This will match all three elements!
buttons = tree.xpath("//div[contains(@class, 'btn')]")
print(len(buttons)) # Output: 3
2. Using starts-with()
Matches attributes that begin with a specific value:
//div[starts-with(@class, 'btn')]
Example:
# Only matches elements where class starts with 'btn'
exact_buttons = tree.xpath("//div[starts-with(@class, 'btn')]")
# Matches: "btn primary large" and "btn-group"
3. Using ends-with()
(XPath 2.0+)
Note: ends-with()
is only available in XPath 2.0, not supported in browsers:
//div[ends-with(@class, 'active')]
For XPath 1.0 compatibility, use this workaround:
//div[substring(@class, string-length(@class) - string-length('active') + 1) = 'active']
Precise Matching Techniques
Exact Class Matching
To match exact class values (avoiding substring issues), use space normalization:
//div[contains(concat(' ', normalize-space(@class), ' '), ' btn ')]
Explanation:
1. normalize-space(@class)
removes extra whitespace
2. concat(' ', ..., ' ')
adds spaces at beginning and end
3. contains(..., ' btn ')
looks for the class surrounded by spaces
Python implementation:
def get_exact_class_xpath(class_name):
return f"//div[contains(concat(' ', normalize-space(@class), ' '), ' {class_name} ')]"
# Usage
exact_buttons = tree.xpath(get_exact_class_xpath('btn'))
Multiple Class Requirements
Select elements that have multiple specific classes:
//div[contains(concat(' ', @class, ' '), ' btn ') and
contains(concat(' ', @class, ' '), ' primary ')]
Python helper function:
def xpath_multiple_classes(classes):
conditions = []
for cls in classes:
conditions.append(f"contains(concat(' ', @class, ' '), ' {cls} ')")
return f"//div[{' and '.join(conditions)}]"
# Find elements with both 'btn' and 'primary' classes
elements = tree.xpath(xpath_multiple_classes(['btn', 'primary']))
Advanced Patterns
OR Logic for Multiple Classes
Find elements with any of several classes:
//div[contains(concat(' ', @class, ' '), ' btn ') or
contains(concat(' ', @class, ' '), ' button ')]
Handling Other Multi-valued Attributes
Works with any space-separated attribute:
//div[contains(concat(' ', @data-tags, ' '), ' red ')]
//input[contains(concat(' ', @data-categories, ' '), ' electronics ')]
Practical Examples
Selenium with Python
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
# Exact class matching
buttons = driver.find_elements(
By.XPATH,
"//button[contains(concat(' ', normalize-space(@class), ' '), ' btn ')]"
)
# Multiple class requirements
primary_buttons = driver.find_elements(
By.XPATH,
"//button[contains(concat(' ', @class, ' '), ' btn ') and "
"contains(concat(' ', @class, ' '), ' primary ')]"
)
JavaScript in Browser
// Exact class matching function
function findByExactClass(className) {
const xpath = `//div[contains(concat(' ', normalize-space(@class), ' '), ' ${className} ')]`;
return document.evaluate(
xpath,
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
}
// Usage
const result = findByExactClass('btn');
for (let i = 0; i < result.snapshotLength; i++) {
console.log(result.snapshotItem(i).textContent);
}
Java with Selenium
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import java.util.List;
public class MultiValuedAttributes {
public static List<WebElement> findByExactClass(WebDriver driver, String className) {
String xpath = String.format(
"//div[contains(concat(' ', normalize-space(@class), ' '), ' %s ')]",
className
);
return driver.findElements(By.xpath(xpath));
}
}
Common Pitfalls and Solutions
Problem: Substring Matching
<div class="btn-large">Large Button</div>
<div class="btn">Regular Button</div>
Wrong: //div[contains(@class, 'btn')]
(matches both)
Right: //div[contains(concat(' ', @class, ' '), ' btn ')]
(matches only second)
Problem: Case Sensitivity
XPath is case-sensitive. For case-insensitive matching:
//div[contains(translate(@class, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'btn')]
Problem: Dynamic Classes
For dynamically generated classes with patterns:
//div[starts-with(@class, 'btn-') and contains(@class, 'primary')]
Performance Considerations
- Specific selectors first: Use more specific element names instead of
//div
- Avoid deep nesting: Use
//
sparingly in complex expressions - Index when possible: Add position predicates
[1]
when you only need the first match
# Better performance
//button[contains(concat(' ', @class, ' '), ' btn ')][1]
# vs slower
(//div[contains(concat(' ', @class, ' '), ' btn ')])[1]
Summary
When handling multi-valued attributes in XPath:
- Use
contains(concat(' ', @attribute, ' '), ' value ')
for exact matching - Combine with
normalize-space()
for robust whitespace handling - Use logical operators (
and
,or
) for complex conditions - Be aware of XPath version limitations (
ends-with()
availability) - Test your expressions thoroughly to avoid false positives
These techniques ensure accurate element selection when scraping modern web applications with complex CSS class structures.