What is the correct way to iterate over elements in an lxml tree?

Iterating over elements in an lxml tree is a fundamental operation for processing XML and HTML documents in Python. The lxml library offers several powerful methods for tree traversal, each optimized for different use cases. Here's a comprehensive guide to the most effective iteration techniques.

Installation

First, ensure you have the lxml package installed:

pip install lxml

Sample XML Document

Throughout this guide, we'll use this sample XML document:

from lxml import etree

xml_content = """
<catalog>
    <book id="1" category="fiction">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <price>12.99</price>
    </book>
    <book id="2" category="non-fiction">
        <title>Sapiens</title>
        <author>Yuval Noah Harari</author>
        <price>15.99</price>
    </book>
    <magazine id="3" category="tech">
        <title>Wired</title>
        <issue>March 2024</issue>
    </magazine>
</catalog>
"""

# Parse the XML
tree = etree.fromstring(xml_content)

Method 1: Iterating Over All Elements (.iter())

The .iter() method is the most efficient way to traverse all elements in document order:

# Iterate over all elements in the tree
for element in tree.iter():
    print(f"Tag: {element.tag}, Text: {element.text}")

# Output:
# Tag: catalog, Text: None
# Tag: book, Text: None  
# Tag: title, Text: The Great Gatsby
# Tag: author, Text: F. Scott Fitzgerald
# Tag: price, Text: 12.99
# ... (continues for all elements)

Filtering by Tag Name

Pass a tag name to .iter() to filter specific elements:

# Find all book elements
for book in tree.iter('book'):
    title = book.find('title').text
    author = book.find('author').text
    print(f"Book: {title} by {author}")

# Output:
# Book: The Great Gatsby by F. Scott Fitzgerald
# Book: Sapiens by Yuval Noah Harari

Multiple Tag Types

You can iterate over multiple tag types by calling .iter() multiple times:

# Process both books and magazines
for item in list(tree.iter('book')) + list(tree.iter('magazine')):
    title = item.find('title').text
    item_type = item.tag
    print(f"{item_type.title()}: {title}")

Method 2: Direct Child Iteration

Iterate over direct children using simple iteration:

# Get the root element (catalog)
root = tree

# Iterate over direct children only
for child in root:
    item_id = child.get('id')
    category = child.get('category')
    title = child.find('title').text
    print(f"ID: {item_id}, Category: {category}, Title: {title}")

# Output:
# ID: 1, Category: fiction, Title: The Great Gatsby
# ID: 2, Category: non-fiction, Title: Sapiens
# ID: 3, Category: tech, Title: Wired

Method 3: XPath Expressions

XPath provides the most powerful and flexible element selection:

# Find all elements with XPath
all_titles = tree.xpath('//title')
for title in all_titles:
    print(f"Title: {title.text}")

# Find elements with specific attributes
fiction_books = tree.xpath('//book[@category="fiction"]')
for book in fiction_books:
    title = book.find('title').text
    print(f"Fiction book: {title}")

# Complex XPath queries
expensive_items = tree.xpath('//*[price > 14]')
for item in expensive_items:
    title = item.find('title').text
    price = item.find('price').text
    print(f"Expensive item: {title} - ${price}")

Advanced XPath Examples

# Find elements by position
first_book = tree.xpath('//book[1]')[0]  # First book element
last_book = tree.xpath('//book[last()]')[0]  # Last book element

# Find elements with specific text content
gatsby_book = tree.xpath('//book[title="The Great Gatsby"]')[0]

# Find parent elements
authors = tree.xpath('//author/..')  # Get book elements that contain authors

Method 4: ElementPath with findall()

Use findall() for ElementPath expressions (simplified XPath):

# Find all titles anywhere in the document
titles = tree.findall('.//title')
for title in titles:
    print(f"Found title: {title.text}")

# Find direct children with specific tag
direct_books = tree.findall('./book')  # Only direct book children
print(f"Found {len(direct_books)} direct book children")

# Combine path elements
book_authors = tree.findall('.//book/author')
for author in book_authors:
    print(f"Book author: {author.text}")

Method 5: Attribute-Based Filtering

Filter elements based on attributes using different approaches:

# Method 1: XPath with attribute predicates
tech_items = tree.xpath('//*[@category="tech"]')
for item in tech_items:
    print(f"Tech item: {item.find('title').text}")

# Method 2: Iterate and filter
for element in tree.iter():
    if element.get('category') == 'fiction':
        title = element.find('title')
        if title is not None:
            print(f"Fiction item: {title.text}")

# Method 3: Multiple attribute conditions
for book in tree.iter('book'):
    if book.get('category') == 'non-fiction' and int(book.get('id')) > 1:
        print(f"Non-fiction book with ID > 1: {book.find('title').text}")

Method 6: Sibling Navigation

Navigate between sibling elements:

# Find the first book
first_book = tree.find('.//book')

# Iterate through all siblings
current = first_book
while current is not None:
    if current.tag in ['book', 'magazine']:
        title = current.find('title').text
        print(f"Sibling: {current.tag} - {title}")
    current = current.getnext()

# Reverse iteration through siblings
last_item = tree.xpath('//*[position()=last()]')[0]
current = last_item
while current is not None:
    if hasattr(current, 'tag'):
        print(f"Reverse sibling: {current.tag}")
    current = current.getprevious()

Method 7: Tree Walking with Parent/Child Navigation

For complex tree traversal patterns:

def walk_tree(element, level=0):
    """Recursively walk the tree with indentation"""
    indent = "  " * level
    text = element.text.strip() if element.text else ""
    print(f"{indent}{element.tag}: {text}")

    # Process all children
    for child in element:
        walk_tree(child, level + 1)

# Walk the entire tree
walk_tree(tree)

Performance Considerations

  • .iter(): Most efficient for processing all elements
  • XPath: Powerful but can be slower for simple queries
  • findall(): Good balance of functionality and performance
  • Direct iteration: Fastest for immediate children only

Error Handling and Best Practices

def safe_iterate_with_attributes(tree, tag_name, required_attrs=None):
    """Safely iterate over elements with error handling"""
    required_attrs = required_attrs or []

    for element in tree.iter(tag_name):
        try:
            # Check for required attributes
            if all(element.get(attr) for attr in required_attrs):
                yield element
        except AttributeError:
            # Handle malformed elements
            continue

# Usage example
for book in safe_iterate_with_attributes(tree, 'book', ['id', 'category']):
    print(f"Safe book iteration: {book.find('title').text}")

Choose the iteration method that best matches your specific use case. For simple traversal, use .iter(). For complex queries, leverage XPath expressions. For performance-critical applications, profile different approaches with your actual data.

HTML Documents

The same techniques work with HTML documents using lxml.html:

from lxml import html

html_content = "<html><body><div class='content'>Hello</div></body></html>"
doc = html.fromstring(html_content)

# Same iteration methods apply
for element in doc.iter():
    print(f"HTML element: {element.tag}")

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon