Iterating over elements in an lxml
tree is a fundamental operation for processing XML and HTML documents in Python. The lxml
library offers several powerful methods for tree traversal, each optimized for different use cases. Here's a comprehensive guide to the most effective iteration techniques.
Installation
First, ensure you have the lxml
package installed:
pip install lxml
Sample XML Document
Throughout this guide, we'll use this sample XML document:
from lxml import etree
xml_content = """
<catalog>
<book id="1" category="fiction">
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<price>12.99</price>
</book>
<book id="2" category="non-fiction">
<title>Sapiens</title>
<author>Yuval Noah Harari</author>
<price>15.99</price>
</book>
<magazine id="3" category="tech">
<title>Wired</title>
<issue>March 2024</issue>
</magazine>
</catalog>
"""
# Parse the XML
tree = etree.fromstring(xml_content)
Method 1: Iterating Over All Elements (.iter())
The .iter()
method is the most efficient way to traverse all elements in document order:
# Iterate over all elements in the tree
for element in tree.iter():
print(f"Tag: {element.tag}, Text: {element.text}")
# Output:
# Tag: catalog, Text: None
# Tag: book, Text: None
# Tag: title, Text: The Great Gatsby
# Tag: author, Text: F. Scott Fitzgerald
# Tag: price, Text: 12.99
# ... (continues for all elements)
Filtering by Tag Name
Pass a tag name to .iter()
to filter specific elements:
# Find all book elements
for book in tree.iter('book'):
title = book.find('title').text
author = book.find('author').text
print(f"Book: {title} by {author}")
# Output:
# Book: The Great Gatsby by F. Scott Fitzgerald
# Book: Sapiens by Yuval Noah Harari
Multiple Tag Types
You can iterate over multiple tag types by calling .iter()
multiple times:
# Process both books and magazines
for item in list(tree.iter('book')) + list(tree.iter('magazine')):
title = item.find('title').text
item_type = item.tag
print(f"{item_type.title()}: {title}")
Method 2: Direct Child Iteration
Iterate over direct children using simple iteration:
# Get the root element (catalog)
root = tree
# Iterate over direct children only
for child in root:
item_id = child.get('id')
category = child.get('category')
title = child.find('title').text
print(f"ID: {item_id}, Category: {category}, Title: {title}")
# Output:
# ID: 1, Category: fiction, Title: The Great Gatsby
# ID: 2, Category: non-fiction, Title: Sapiens
# ID: 3, Category: tech, Title: Wired
Method 3: XPath Expressions
XPath provides the most powerful and flexible element selection:
# Find all elements with XPath
all_titles = tree.xpath('//title')
for title in all_titles:
print(f"Title: {title.text}")
# Find elements with specific attributes
fiction_books = tree.xpath('//book[@category="fiction"]')
for book in fiction_books:
title = book.find('title').text
print(f"Fiction book: {title}")
# Complex XPath queries
expensive_items = tree.xpath('//*[price > 14]')
for item in expensive_items:
title = item.find('title').text
price = item.find('price').text
print(f"Expensive item: {title} - ${price}")
Advanced XPath Examples
# Find elements by position
first_book = tree.xpath('//book[1]')[0] # First book element
last_book = tree.xpath('//book[last()]')[0] # Last book element
# Find elements with specific text content
gatsby_book = tree.xpath('//book[title="The Great Gatsby"]')[0]
# Find parent elements
authors = tree.xpath('//author/..') # Get book elements that contain authors
Method 4: ElementPath with findall()
Use findall()
for ElementPath expressions (simplified XPath):
# Find all titles anywhere in the document
titles = tree.findall('.//title')
for title in titles:
print(f"Found title: {title.text}")
# Find direct children with specific tag
direct_books = tree.findall('./book') # Only direct book children
print(f"Found {len(direct_books)} direct book children")
# Combine path elements
book_authors = tree.findall('.//book/author')
for author in book_authors:
print(f"Book author: {author.text}")
Method 5: Attribute-Based Filtering
Filter elements based on attributes using different approaches:
# Method 1: XPath with attribute predicates
tech_items = tree.xpath('//*[@category="tech"]')
for item in tech_items:
print(f"Tech item: {item.find('title').text}")
# Method 2: Iterate and filter
for element in tree.iter():
if element.get('category') == 'fiction':
title = element.find('title')
if title is not None:
print(f"Fiction item: {title.text}")
# Method 3: Multiple attribute conditions
for book in tree.iter('book'):
if book.get('category') == 'non-fiction' and int(book.get('id')) > 1:
print(f"Non-fiction book with ID > 1: {book.find('title').text}")
Method 6: Sibling Navigation
Navigate between sibling elements:
# Find the first book
first_book = tree.find('.//book')
# Iterate through all siblings
current = first_book
while current is not None:
if current.tag in ['book', 'magazine']:
title = current.find('title').text
print(f"Sibling: {current.tag} - {title}")
current = current.getnext()
# Reverse iteration through siblings
last_item = tree.xpath('//*[position()=last()]')[0]
current = last_item
while current is not None:
if hasattr(current, 'tag'):
print(f"Reverse sibling: {current.tag}")
current = current.getprevious()
Method 7: Tree Walking with Parent/Child Navigation
For complex tree traversal patterns:
def walk_tree(element, level=0):
"""Recursively walk the tree with indentation"""
indent = " " * level
text = element.text.strip() if element.text else ""
print(f"{indent}{element.tag}: {text}")
# Process all children
for child in element:
walk_tree(child, level + 1)
# Walk the entire tree
walk_tree(tree)
Performance Considerations
.iter()
: Most efficient for processing all elements- XPath: Powerful but can be slower for simple queries
findall()
: Good balance of functionality and performance- Direct iteration: Fastest for immediate children only
Error Handling and Best Practices
def safe_iterate_with_attributes(tree, tag_name, required_attrs=None):
"""Safely iterate over elements with error handling"""
required_attrs = required_attrs or []
for element in tree.iter(tag_name):
try:
# Check for required attributes
if all(element.get(attr) for attr in required_attrs):
yield element
except AttributeError:
# Handle malformed elements
continue
# Usage example
for book in safe_iterate_with_attributes(tree, 'book', ['id', 'category']):
print(f"Safe book iteration: {book.find('title').text}")
Choose the iteration method that best matches your specific use case. For simple traversal, use .iter()
. For complex queries, leverage XPath expressions. For performance-critical applications, profile different approaches with your actual data.
HTML Documents
The same techniques work with HTML documents using lxml.html
:
from lxml import html
html_content = "<html><body><div class='content'>Hello</div></body></html>"
doc = html.fromstring(html_content)
# Same iteration methods apply
for element in doc.iter():
print(f"HTML element: {element.tag}")