XPath case sensitivity depends on whether you're working with HTML or XML documents, and which parser you're using. Understanding this distinction is crucial for writing reliable web scraping code.
Key Differences: HTML vs XML
XML Documents (Case-Sensitive)
In XML, tag names are strictly case-sensitive:
- <Element>
and <element>
are completely different elements
- XPath queries must match the exact case of tags in the document
HTML Documents (Usually Case-Insensitive)
HTML tags are inherently case-insensitive, and most HTML parsers normalize tag names to lowercase in the DOM representation:
- <DIV>
, <div>
, and <Div>
all represent the same element
- XPath queries typically use lowercase tag names regardless of the original HTML case
How Different Parsers Handle Case
Python with lxml (HTML Parser)
The lxml
HTML parser normalizes all tags to lowercase:
from lxml import html
# HTML with mixed case tags
html_content = """
<!DOCTYPE html>
<html>
<head>
<TITLE>Example Page</TITLE>
</head>
<body>
<DIV class="content">Uppercase DIV</DIV>
<div class="content">Lowercase div</div>
<P>Mixed case paragraph</P>
</body>
</html>
"""
# Parse as HTML (normalizes to lowercase)
tree = html.fromstring(html_content)
# All XPath queries use lowercase, regardless of original case
divs = tree.xpath('//div[@class="content"]')
titles = tree.xpath('//title')
paragraphs = tree.xpath('//p')
print(f"Found {len(divs)} div elements") # Output: Found 2 div elements
print(f"Found {len(titles)} title elements") # Output: Found 1 title elements
print(f"Found {len(paragraphs)} p elements") # Output: Found 1 p elements
# This won't work because lxml normalized everything to lowercase
uppercase_divs = tree.xpath('//DIV')
print(f"Found {len(uppercase_divs)} DIV elements") # Output: Found 0 DIV elements
Python with lxml (XML Parser)
When parsing as XML, case is preserved and must match exactly:
from lxml import etree
xml_content = """
<root>
<Element>Case sensitive</Element>
<element>Different element</element>
<ELEMENT>Another different element</ELEMENT>
</root>
"""
# Parse as XML (preserves case)
tree = etree.fromstring(xml_content)
# Each case variation is treated as a different element
elements_upper = tree.xpath('//Element') # Finds 1 element
elements_lower = tree.xpath('//element') # Finds 1 element
elements_caps = tree.xpath('//ELEMENT') # Finds 1 element
print(f"//Element: {len(elements_upper)}") # Output: //Element: 1
print(f"//element: {len(elements_lower)}") # Output: //element: 1
print(f"//ELEMENT: {len(elements_caps)}") # Output: //ELEMENT: 1
JavaScript in Browser DOM
Browser DOM normalizes HTML tag names to lowercase:
// This works regardless of the original HTML case
const divs = document.evaluate('//div', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
// Even if HTML source has <DIV>, <Div>, etc., this query finds them all
for (let i = 0; i < divs.snapshotLength; i++) {
const div = divs.snapshotItem(i);
console.log(div.textContent);
}
// This won't work because DOM uses lowercase tag names
const upperDivs = document.evaluate('//DIV', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
console.log(`Found ${upperDivs.snapshotLength} uppercase DIV elements`); // Usually 0
Selenium WebDriver
Selenium also normalizes HTML tag names to lowercase:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("http://example.com")
# Use lowercase in XPath expressions
divs = driver.find_elements(By.XPATH, "//div[@class='content']")
headers = driver.find_elements(By.XPATH, "//h1 | //h2 | //h3")
# This works even if the HTML source uses uppercase tags
nav_elements = driver.find_elements(By.XPATH, "//nav//a")
Practical Recommendations
For HTML Documents
- Always use lowercase tag names in XPath expressions
- Don't worry about the case in the original HTML source
- Focus on other selectors (attributes, text content) for precision
# Good practices for HTML
tree.xpath('//div[@class="header"]')
tree.xpath('//input[@type="text"]')
tree.xpath('//a[contains(@href, "example.com")]')
For XML Documents
- Match the exact case of tags in the source document
- Inspect the XML structure first to understand the casing convention
- Consider using case-insensitive functions when appropriate
# For XML, match exact case
tree.xpath('//Product[@ID="123"]')
tree.xpath('//customerData/firstName')
# Case-insensitive alternative using translate() function
tree.xpath('//node()[translate(local-name(), "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") = "product"]')
When in Doubt
Test your XPath expressions with sample data to ensure they work correctly with your specific parser and document type.
Common Pitfalls
- Assuming XML behavior in HTML: Writing
//DIV
when working with HTML parsers - Assuming HTML behavior in XML: Using lowercase when XML tags are actually capitalized
- Mixed document types: Not recognizing when XHTML is being parsed as XML vs HTML
Understanding these distinctions will help you write more reliable XPath expressions for your web scraping projects.