Table of contents

How does XPath handle case sensitivity in HTML tags?

XPath case sensitivity depends on whether you're working with HTML or XML documents, and which parser you're using. Understanding this distinction is crucial for writing reliable web scraping code.

Key Differences: HTML vs XML

XML Documents (Case-Sensitive)

In XML, tag names are strictly case-sensitive: - <Element> and <element> are completely different elements - XPath queries must match the exact case of tags in the document

HTML Documents (Usually Case-Insensitive)

HTML tags are inherently case-insensitive, and most HTML parsers normalize tag names to lowercase in the DOM representation: - <DIV>, <div>, and <Div> all represent the same element - XPath queries typically use lowercase tag names regardless of the original HTML case

How Different Parsers Handle Case

Python with lxml (HTML Parser)

The lxml HTML parser normalizes all tags to lowercase:

from lxml import html

# HTML with mixed case tags
html_content = """
<!DOCTYPE html>
<html>
<head>
    <TITLE>Example Page</TITLE>
</head>
<body>
    <DIV class="content">Uppercase DIV</DIV>
    <div class="content">Lowercase div</div>
    <P>Mixed case paragraph</P>
</body>
</html>
"""

# Parse as HTML (normalizes to lowercase)
tree = html.fromstring(html_content)

# All XPath queries use lowercase, regardless of original case
divs = tree.xpath('//div[@class="content"]')
titles = tree.xpath('//title')
paragraphs = tree.xpath('//p')

print(f"Found {len(divs)} div elements")      # Output: Found 2 div elements
print(f"Found {len(titles)} title elements")  # Output: Found 1 title elements
print(f"Found {len(paragraphs)} p elements")  # Output: Found 1 p elements

# This won't work because lxml normalized everything to lowercase
uppercase_divs = tree.xpath('//DIV')
print(f"Found {len(uppercase_divs)} DIV elements")  # Output: Found 0 DIV elements

Python with lxml (XML Parser)

When parsing as XML, case is preserved and must match exactly:

from lxml import etree

xml_content = """
<root>
    <Element>Case sensitive</Element>
    <element>Different element</element>
    <ELEMENT>Another different element</ELEMENT>
</root>
"""

# Parse as XML (preserves case)
tree = etree.fromstring(xml_content)

# Each case variation is treated as a different element
elements_upper = tree.xpath('//Element')    # Finds 1 element
elements_lower = tree.xpath('//element')    # Finds 1 element  
elements_caps = tree.xpath('//ELEMENT')     # Finds 1 element

print(f"//Element: {len(elements_upper)}")  # Output: //Element: 1
print(f"//element: {len(elements_lower)}")  # Output: //element: 1
print(f"//ELEMENT: {len(elements_caps)}")   # Output: //ELEMENT: 1

JavaScript in Browser DOM

Browser DOM normalizes HTML tag names to lowercase:

// This works regardless of the original HTML case
const divs = document.evaluate('//div', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);

// Even if HTML source has <DIV>, <Div>, etc., this query finds them all
for (let i = 0; i < divs.snapshotLength; i++) {
    const div = divs.snapshotItem(i);
    console.log(div.textContent);
}

// This won't work because DOM uses lowercase tag names
const upperDivs = document.evaluate('//DIV', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
console.log(`Found ${upperDivs.snapshotLength} uppercase DIV elements`); // Usually 0

Selenium WebDriver

Selenium also normalizes HTML tag names to lowercase:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("http://example.com")

# Use lowercase in XPath expressions
divs = driver.find_elements(By.XPATH, "//div[@class='content']")
headers = driver.find_elements(By.XPATH, "//h1 | //h2 | //h3")

# This works even if the HTML source uses uppercase tags
nav_elements = driver.find_elements(By.XPATH, "//nav//a")

Practical Recommendations

For HTML Documents

  1. Always use lowercase tag names in XPath expressions
  2. Don't worry about the case in the original HTML source
  3. Focus on other selectors (attributes, text content) for precision
# Good practices for HTML
tree.xpath('//div[@class="header"]')
tree.xpath('//input[@type="text"]')
tree.xpath('//a[contains(@href, "example.com")]')

For XML Documents

  1. Match the exact case of tags in the source document
  2. Inspect the XML structure first to understand the casing convention
  3. Consider using case-insensitive functions when appropriate
# For XML, match exact case
tree.xpath('//Product[@ID="123"]')
tree.xpath('//customerData/firstName')

# Case-insensitive alternative using translate() function
tree.xpath('//node()[translate(local-name(), "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") = "product"]')

When in Doubt

Test your XPath expressions with sample data to ensure they work correctly with your specific parser and document type.

Common Pitfalls

  1. Assuming XML behavior in HTML: Writing //DIV when working with HTML parsers
  2. Assuming HTML behavior in XML: Using lowercase when XML tags are actually capitalized
  3. Mixed document types: Not recognizing when XHTML is being parsed as XML vs HTML

Understanding these distinctions will help you write more reliable XPath expressions for your web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon