Finding elements by class or ID is a fundamental task in web scraping with lxml
. The library provides two primary methods: XPath expressions and CSS selectors. Both approaches are effective for targeting specific elements in HTML and XML documents.
Method 1: Using XPath
XPath provides powerful expressions for navigating document structures. Here are the most common patterns:
Finding Elements by ID
from lxml import html
# Sample HTML
content = """
<html>
<body>
<div id="header" class="main-header">
<h1>Welcome</h1>
</div>
<div id="content" class="container large">
<p class="text highlight">Important text</p>
<p class="text">Regular text</p>
</div>
</body>
</html>
"""
tree = html.fromstring(content)
# Find by exact ID
header = tree.xpath("//div[@id='header']")[0]
print(header.tag) # Output: div
# Alternative: using id() function (more efficient)
content_div = tree.xpath("id('content')")[0]
print(content_div.get('class')) # Output: container large
Finding Elements by Class
Class selection requires careful handling since elements can have multiple classes:
# Method 1: Exact class match (single class only)
elements = tree.xpath("//p[@class='text']")
# Method 2: Contains class (handles multiple classes)
elements = tree.xpath("//p[contains(@class, 'text')]")
# Method 3: Precise class matching (recommended)
elements = tree.xpath("//p[contains(concat(' ', normalize-space(@class), ' '), ' text ')]")
for elem in elements:
print(f"Text: {elem.text}, Classes: {elem.get('class')}")
Advanced XPath Examples
# Find elements with multiple specific classes
highlight_text = tree.xpath("//p[contains(@class, 'text') and contains(@class, 'highlight')]")
# Find by class and get specific attributes
class_values = tree.xpath("//div[contains(@class, 'container')]/@class")
# Find by ID and get child elements
child_elements = tree.xpath("//div[@id='content']//p")
# Combine conditions
specific_elem = tree.xpath("//div[@id='content']//p[contains(@class, 'highlight')]")
Method 2: Using CSS Selectors
CSS selectors provide a more familiar syntax for those with web development experience:
from lxml import html
from lxml.cssselect import CSSSelector
tree = html.fromstring(content)
# Find by ID
header_selector = CSSSelector('#header')
header = header_selector(tree)[0]
print(header.tag) # Output: div
# Find by class
text_selector = CSSSelector('.text')
text_elements = text_selector(tree)
for elem in text_elements:
print(elem.text)
# Multiple classes
highlight_selector = CSSSelector('.text.highlight')
highlight_elem = highlight_selector(tree)[0]
print(highlight_elem.text) # Output: Important text
# Advanced CSS selectors
complex_selector = CSSSelector('#content p.text:first-child')
first_text = complex_selector(tree)[0]
Method 3: Direct CSS Selection (Simplified)
For simpler use cases, lxml elements support direct CSS selection:
# Direct CSS selection on elements
content_div = tree.cssselect('#content')[0]
text_paragraphs = content_div.cssselect('p.text')
# Chain selections
highlight = tree.cssselect('#content')[0].cssselect('.highlight')[0]
Performance Comparison and Best Practices
When to Use XPath vs CSS Selectors
Use XPath when: - You need complex conditional logic - Working with XML namespaces - Performing text-based searches - Need parent/sibling navigation
Use CSS Selectors when: - You're familiar with CSS syntax - Simple class/ID selections - Working primarily with HTML
Performance Tips
# Compile selectors for repeated use
text_selector = CSSSelector('.text')
results1 = text_selector(tree)
results2 = text_selector(another_tree)
# Use specific paths instead of //
faster = tree.xpath("/html/body/div[@id='content']//p")
slower = tree.xpath("//p") # Searches entire document
# Cache frequently used elements
content_div = tree.xpath("//div[@id='content']")[0]
paragraphs = content_div.xpath(".//p") # Relative search
Error Handling
# Safe element selection
def safe_find_by_id(tree, element_id):
elements = tree.xpath(f"//*[@id='{element_id}']")
return elements[0] if elements else None
def safe_find_by_class(tree, class_name):
elements = tree.cssselect(f'.{class_name}')
return elements if elements else []
# Usage
header = safe_find_by_id(tree, 'header')
if header is not None:
print(f"Found header: {header.tag}")
Both XPath and CSS selectors are powerful tools for element selection in lxml. Choose the method that best fits your use case: XPath for complex queries and CSS selectors for familiar, web-standard syntax.