Table of contents

How do I use lxml to transform XML documents using XSLT?

XSLT (Extensible Stylesheet Language Transformations) is a powerful language for transforming XML documents into different formats. Python's lxml library provides excellent support for XSLT transformations, allowing you to convert XML data into HTML, text, or other XML formats programmatically. This guide covers everything you need to know about using lxml for XSLT transformations.

Understanding XSLT and lxml

XSLT works by applying transformation rules defined in an XSLT stylesheet to an XML document. The lxml library provides a robust XSLT processor that can handle complex transformations efficiently. This is particularly useful in web scraping scenarios where you need to transform scraped XML data into more usable formats.

Basic XSLT Transformation Setup

First, ensure you have lxml installed:

pip install lxml

Here's the basic structure for performing XSLT transformations with lxml:

from lxml import etree

# Load XML document
xml_doc = etree.parse('input.xml')

# Load XSLT stylesheet
xslt_doc = etree.parse('transform.xsl')

# Create XSLT transformer
transform = etree.XSLT(xslt_doc)

# Apply transformation
result = transform(xml_doc)

# Get the transformed output
output = str(result)
print(output)

Creating XSLT Stylesheets

An XSLT stylesheet defines how to transform XML elements. Here's a simple example that converts XML to HTML:

transform.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html" indent="yes"/>

    <xsl:template match="/">
        <html>
            <head>
                <title>Transformed Document</title>
            </head>
            <body>
                <xsl:apply-templates select="//book"/>
            </body>
        </html>
    </xsl:template>

    <xsl:template match="book">
        <div class="book">
            <h2><xsl:value-of select="title"/></h2>
            <p>Author: <xsl:value-of select="author"/></p>
            <p>Price: $<xsl:value-of select="price"/></p>
        </div>
    </xsl:template>
</xsl:stylesheet>

Practical XSLT Transformation Examples

Example 1: XML to HTML Transformation

Consider this XML document (books.xml):

<?xml version="1.0" encoding="UTF-8"?>
<library>
    <book id="1">
        <title>Python Web Scraping</title>
        <author>John Doe</author>
        <price>29.99</price>
        <category>Programming</category>
    </book>
    <book id="2">
        <title>Data Mining Techniques</title>
        <author>Jane Smith</author>
        <price>39.99</price>
        <category>Data Science</category>
    </book>
</library>

Here's the Python code to transform it:

from lxml import etree

def transform_xml_to_html(xml_file, xslt_file, output_file):
    """Transform XML document to HTML using XSLT stylesheet."""
    try:
        # Parse XML document
        xml_doc = etree.parse(xml_file)

        # Parse XSLT stylesheet
        xslt_doc = etree.parse(xslt_file)

        # Create transformer
        transform = etree.XSLT(xslt_doc)

        # Apply transformation
        result = transform(xml_doc)

        # Save to file
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(str(result))

        print(f"Transformation successful. Output saved to {output_file}")
        return str(result)

    except etree.XSLTApplyError as e:
        print(f"XSLT transformation error: {e}")
    except etree.XMLSyntaxError as e:
        print(f"XML parsing error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Usage
result = transform_xml_to_html('books.xml', 'transform.xsl', 'output.html')

Example 2: XML to CSV Transformation

You can also transform XML to CSV format using XSLT:

xml_to_csv.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>

    <xsl:template match="/">
        <xsl:text>ID,Title,Author,Price,Category&#10;</xsl:text>
        <xsl:for-each select="//book">
            <xsl:value-of select="@id"/>,<xsl:value-of select="title"/>,<xsl:value-of select="author"/>,<xsl:value-of select="price"/>,<xsl:value-of select="category"/>
            <xsl:text>&#10;</xsl:text>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

Working with XSLT Parameters

XSLT transformations can accept parameters, making them more flexible:

from lxml import etree

def transform_with_parameters(xml_data, xslt_data, params=None):
    """Transform XML with XSLT parameters."""
    xml_doc = etree.fromstring(xml_data)
    xslt_doc = etree.fromstring(xslt_data)

    transform = etree.XSLT(xslt_doc)

    # Apply transformation with parameters
    if params:
        result = transform(xml_doc, **params)
    else:
        result = transform(xml_doc)

    return str(result)

# XSLT with parameters
xslt_with_params = """<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:param name="category-filter" select="'all'"/>
    <xsl:output method="html" indent="yes"/>

    <xsl:template match="/">
        <html>
            <body>
                <h1>Books in category: <xsl:value-of select="$category-filter"/></h1>
                <xsl:choose>
                    <xsl:when test="$category-filter = 'all'">
                        <xsl:apply-templates select="//book"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:apply-templates select="//book[category = $category-filter]"/>
                    </xsl:otherwise>
                </xsl:choose>
            </body>
        </html>
    </xsl:template>

    <xsl:template match="book">
        <div><xsl:value-of select="title"/> - <xsl:value-of select="author"/></div>
    </xsl:template>
</xsl:stylesheet>"""

# Usage with parameters
params = {'category-filter': etree.XSLT.strparam('Programming')}
result = transform_with_parameters(xml_data, xslt_with_params, params)

Advanced XSLT Features with lxml

Error Handling and Validation

Proper error handling is crucial when working with XSLT transformations:

from lxml import etree
import logging

class XSLTProcessor:
    def __init__(self):
        self.logger = logging.getLogger(__name__)

    def validate_xml(self, xml_content):
        """Validate XML document."""
        try:
            etree.fromstring(xml_content)
            return True
        except etree.XMLSyntaxError as e:
            self.logger.error(f"XML validation failed: {e}")
            return False

    def validate_xslt(self, xslt_content):
        """Validate XSLT stylesheet."""
        try:
            xslt_doc = etree.fromstring(xslt_content)
            etree.XSLT(xslt_doc)
            return True
        except (etree.XMLSyntaxError, etree.XSLTParseError) as e:
            self.logger.error(f"XSLT validation failed: {e}")
            return False

    def transform(self, xml_content, xslt_content, params=None):
        """Safe XSLT transformation with validation."""
        if not self.validate_xml(xml_content):
            raise ValueError("Invalid XML document")

        if not self.validate_xslt(xslt_content):
            raise ValueError("Invalid XSLT stylesheet")

        try:
            xml_doc = etree.fromstring(xml_content)
            xslt_doc = etree.fromstring(xslt_content)
            transform = etree.XSLT(xslt_doc)

            if params:
                result = transform(xml_doc, **params)
            else:
                result = transform(xml_doc)

            return str(result)

        except etree.XSLTApplyError as e:
            self.logger.error(f"XSLT transformation failed: {e}")
            raise

Using Extension Functions

lxml allows you to register custom Python functions for use in XSLT:

from lxml import etree

def format_price(context, price):
    """Custom function to format price with currency."""
    try:
        return f"${float(price):.2f}"
    except (ValueError, TypeError):
        return "N/A"

def transform_with_extensions(xml_data, xslt_data):
    """Transform XML using custom extension functions."""
    # Register extension function
    ns = etree.FunctionNamespace("http://example.com/functions")
    ns.prefix = "custom"
    ns["format-price"] = format_price

    xml_doc = etree.fromstring(xml_data)
    xslt_doc = etree.fromstring(xslt_data)

    transform = etree.XSLT(xslt_doc)
    result = transform(xml_doc)

    return str(result)

Performance Optimization Tips

Reusing XSLT Transformers

For better performance when applying the same transformation multiple times:

from lxml import etree

class CachedXSLTProcessor:
    def __init__(self):
        self._transformers = {}

    def get_transformer(self, xslt_content):
        """Get cached transformer or create new one."""
        xslt_hash = hash(xslt_content)

        if xslt_hash not in self._transformers:
            xslt_doc = etree.fromstring(xslt_content)
            self._transformers[xslt_hash] = etree.XSLT(xslt_doc)

        return self._transformers[xslt_hash]

    def transform(self, xml_content, xslt_content):
        """Transform using cached transformer."""
        xml_doc = etree.fromstring(xml_content)
        transformer = self.get_transformer(xslt_content)

        result = transformer(xml_doc)
        return str(result)

# Usage
processor = CachedXSLTProcessor()
result1 = processor.transform(xml_data1, xslt_stylesheet)
result2 = processor.transform(xml_data2, xslt_stylesheet)  # Uses cached transformer

Integration with Web Scraping Workflows

When handling malformed HTML documents with lxml, you might receive XML data that needs transformation. Here's how to integrate XSLT processing into your scraping workflow:

from lxml import etree, html
import requests

def scrape_and_transform_xml(url, xslt_stylesheet):
    """Scrape XML data and apply XSLT transformation."""
    try:
        # Fetch XML data
        response = requests.get(url)
        response.raise_for_status()

        # Parse XML
        xml_doc = etree.fromstring(response.content)

        # Load XSLT stylesheet
        xslt_doc = etree.fromstring(xslt_stylesheet)
        transform = etree.XSLT(xslt_doc)

        # Apply transformation
        result = transform(xml_doc)

        return str(result)

    except requests.RequestException as e:
        print(f"HTTP request failed: {e}")
    except etree.XSLTApplyError as e:
        print(f"XSLT transformation failed: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

Common XSLT Patterns and Use Cases

Filtering and Sorting Data

<!-- XSLT for filtering and sorting books by price -->
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html" indent="yes"/>

    <xsl:template match="/">
        <html>
            <body>
                <h1>Books under $35 (sorted by price)</h1>
                <xsl:for-each select="//book[price &lt; 35]">
                    <xsl:sort select="price" data-type="number"/>
                    <div>
                        <strong><xsl:value-of select="title"/></strong> - 
                        $<xsl:value-of select="price"/>
                    </div>
                </xsl:for-each>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

Grouping Data

<!-- XSLT for grouping books by category -->
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html" indent="yes"/>
    <xsl:key name="books-by-category" match="book" use="category"/>

    <xsl:template match="/">
        <html>
            <body>
                <h1>Books by Category</h1>
                <xsl:for-each select="//book[generate-id() = generate-id(key('books-by-category', category)[1])]">
                    <h2><xsl:value-of select="category"/></h2>
                    <ul>
                        <xsl:for-each select="key('books-by-category', category)">
                            <li><xsl:value-of select="title"/> by <xsl:value-of select="author"/></li>
                        </xsl:for-each>
                    </ul>
                </xsl:for-each>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

Troubleshooting Common Issues

Memory Management

For large XML documents, consider using incremental parsing:

from lxml import etree

def transform_large_xml(xml_file, xslt_file):
    """Transform large XML files efficiently."""
    # Use iterparse for memory efficiency
    context = etree.iterparse(xml_file, events=('start', 'end'))
    context = iter(context)
    event, root = next(context)

    # Process in chunks or apply transformation to smaller sections
    # This approach depends on your specific XML structure

    # Load XSLT
    xslt_doc = etree.parse(xslt_file)
    transform = etree.XSLT(xslt_doc)

    # Apply transformation to complete document
    xml_doc = etree.parse(xml_file)
    result = transform(xml_doc)

    return str(result)

Best Practices

  1. Validate Input: Always validate both XML and XSLT documents before transformation
  2. Cache Transformers: Reuse XSLT transformer objects for better performance
  3. Handle Errors Gracefully: Implement proper error handling for various failure scenarios
  4. Use Parameters: Make your XSLT stylesheets flexible with parameters
  5. Memory Management: Be mindful of memory usage with large XML documents
  6. Security: Be cautious with user-provided XSLT stylesheets as they can contain malicious code

The lxml library provides a powerful and efficient way to perform XSLT transformations in Python. Whether you're transforming scraped XML data into more usable formats or implementing complex data processing pipelines, understanding these techniques will help you build robust and efficient applications. When working with custom parser options when using lxml, consider using incremental parsing strategies combined with your XSLT transformations for optimal performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon