How do I use lxml to transform XML documents using XSLT?
XSLT (Extensible Stylesheet Language Transformations) is a powerful language for transforming XML documents into different formats. Python's lxml library provides excellent support for XSLT transformations, allowing you to convert XML data into HTML, text, or other XML formats programmatically. This guide covers everything you need to know about using lxml for XSLT transformations.
Understanding XSLT and lxml
XSLT works by applying transformation rules defined in an XSLT stylesheet to an XML document. The lxml library provides a robust XSLT processor that can handle complex transformations efficiently. This is particularly useful in web scraping scenarios where you need to transform scraped XML data into more usable formats.
Basic XSLT Transformation Setup
First, ensure you have lxml installed:
pip install lxml
Here's the basic structure for performing XSLT transformations with lxml:
from lxml import etree
# Load XML document
xml_doc = etree.parse('input.xml')
# Load XSLT stylesheet
xslt_doc = etree.parse('transform.xsl')
# Create XSLT transformer
transform = etree.XSLT(xslt_doc)
# Apply transformation
result = transform(xml_doc)
# Get the transformed output
output = str(result)
print(output)
Creating XSLT Stylesheets
An XSLT stylesheet defines how to transform XML elements. Here's a simple example that converts XML to HTML:
transform.xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<html>
<head>
<title>Transformed Document</title>
</head>
<body>
<xsl:apply-templates select="//book"/>
</body>
</html>
</xsl:template>
<xsl:template match="book">
<div class="book">
<h2><xsl:value-of select="title"/></h2>
<p>Author: <xsl:value-of select="author"/></p>
<p>Price: $<xsl:value-of select="price"/></p>
</div>
</xsl:template>
</xsl:stylesheet>
Practical XSLT Transformation Examples
Example 1: XML to HTML Transformation
Consider this XML document (books.xml):
<?xml version="1.0" encoding="UTF-8"?>
<library>
<book id="1">
<title>Python Web Scraping</title>
<author>John Doe</author>
<price>29.99</price>
<category>Programming</category>
</book>
<book id="2">
<title>Data Mining Techniques</title>
<author>Jane Smith</author>
<price>39.99</price>
<category>Data Science</category>
</book>
</library>
Here's the Python code to transform it:
from lxml import etree
def transform_xml_to_html(xml_file, xslt_file, output_file):
"""Transform XML document to HTML using XSLT stylesheet."""
try:
# Parse XML document
xml_doc = etree.parse(xml_file)
# Parse XSLT stylesheet
xslt_doc = etree.parse(xslt_file)
# Create transformer
transform = etree.XSLT(xslt_doc)
# Apply transformation
result = transform(xml_doc)
# Save to file
with open(output_file, 'w', encoding='utf-8') as f:
f.write(str(result))
print(f"Transformation successful. Output saved to {output_file}")
return str(result)
except etree.XSLTApplyError as e:
print(f"XSLT transformation error: {e}")
except etree.XMLSyntaxError as e:
print(f"XML parsing error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
# Usage
result = transform_xml_to_html('books.xml', 'transform.xsl', 'output.html')
Example 2: XML to CSV Transformation
You can also transform XML to CSV format using XSLT:
xml_to_csv.xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:text>ID,Title,Author,Price,Category </xsl:text>
<xsl:for-each select="//book">
<xsl:value-of select="@id"/>,<xsl:value-of select="title"/>,<xsl:value-of select="author"/>,<xsl:value-of select="price"/>,<xsl:value-of select="category"/>
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Working with XSLT Parameters
XSLT transformations can accept parameters, making them more flexible:
from lxml import etree
def transform_with_parameters(xml_data, xslt_data, params=None):
"""Transform XML with XSLT parameters."""
xml_doc = etree.fromstring(xml_data)
xslt_doc = etree.fromstring(xslt_data)
transform = etree.XSLT(xslt_doc)
# Apply transformation with parameters
if params:
result = transform(xml_doc, **params)
else:
result = transform(xml_doc)
return str(result)
# XSLT with parameters
xslt_with_params = """<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="category-filter" select="'all'"/>
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<html>
<body>
<h1>Books in category: <xsl:value-of select="$category-filter"/></h1>
<xsl:choose>
<xsl:when test="$category-filter = 'all'">
<xsl:apply-templates select="//book"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="//book[category = $category-filter]"/>
</xsl:otherwise>
</xsl:choose>
</body>
</html>
</xsl:template>
<xsl:template match="book">
<div><xsl:value-of select="title"/> - <xsl:value-of select="author"/></div>
</xsl:template>
</xsl:stylesheet>"""
# Usage with parameters
params = {'category-filter': etree.XSLT.strparam('Programming')}
result = transform_with_parameters(xml_data, xslt_with_params, params)
Advanced XSLT Features with lxml
Error Handling and Validation
Proper error handling is crucial when working with XSLT transformations:
from lxml import etree
import logging
class XSLTProcessor:
def __init__(self):
self.logger = logging.getLogger(__name__)
def validate_xml(self, xml_content):
"""Validate XML document."""
try:
etree.fromstring(xml_content)
return True
except etree.XMLSyntaxError as e:
self.logger.error(f"XML validation failed: {e}")
return False
def validate_xslt(self, xslt_content):
"""Validate XSLT stylesheet."""
try:
xslt_doc = etree.fromstring(xslt_content)
etree.XSLT(xslt_doc)
return True
except (etree.XMLSyntaxError, etree.XSLTParseError) as e:
self.logger.error(f"XSLT validation failed: {e}")
return False
def transform(self, xml_content, xslt_content, params=None):
"""Safe XSLT transformation with validation."""
if not self.validate_xml(xml_content):
raise ValueError("Invalid XML document")
if not self.validate_xslt(xslt_content):
raise ValueError("Invalid XSLT stylesheet")
try:
xml_doc = etree.fromstring(xml_content)
xslt_doc = etree.fromstring(xslt_content)
transform = etree.XSLT(xslt_doc)
if params:
result = transform(xml_doc, **params)
else:
result = transform(xml_doc)
return str(result)
except etree.XSLTApplyError as e:
self.logger.error(f"XSLT transformation failed: {e}")
raise
Using Extension Functions
lxml allows you to register custom Python functions for use in XSLT:
from lxml import etree
def format_price(context, price):
"""Custom function to format price with currency."""
try:
return f"${float(price):.2f}"
except (ValueError, TypeError):
return "N/A"
def transform_with_extensions(xml_data, xslt_data):
"""Transform XML using custom extension functions."""
# Register extension function
ns = etree.FunctionNamespace("http://example.com/functions")
ns.prefix = "custom"
ns["format-price"] = format_price
xml_doc = etree.fromstring(xml_data)
xslt_doc = etree.fromstring(xslt_data)
transform = etree.XSLT(xslt_doc)
result = transform(xml_doc)
return str(result)
Performance Optimization Tips
Reusing XSLT Transformers
For better performance when applying the same transformation multiple times:
from lxml import etree
class CachedXSLTProcessor:
def __init__(self):
self._transformers = {}
def get_transformer(self, xslt_content):
"""Get cached transformer or create new one."""
xslt_hash = hash(xslt_content)
if xslt_hash not in self._transformers:
xslt_doc = etree.fromstring(xslt_content)
self._transformers[xslt_hash] = etree.XSLT(xslt_doc)
return self._transformers[xslt_hash]
def transform(self, xml_content, xslt_content):
"""Transform using cached transformer."""
xml_doc = etree.fromstring(xml_content)
transformer = self.get_transformer(xslt_content)
result = transformer(xml_doc)
return str(result)
# Usage
processor = CachedXSLTProcessor()
result1 = processor.transform(xml_data1, xslt_stylesheet)
result2 = processor.transform(xml_data2, xslt_stylesheet) # Uses cached transformer
Integration with Web Scraping Workflows
When handling malformed HTML documents with lxml, you might receive XML data that needs transformation. Here's how to integrate XSLT processing into your scraping workflow:
from lxml import etree, html
import requests
def scrape_and_transform_xml(url, xslt_stylesheet):
"""Scrape XML data and apply XSLT transformation."""
try:
# Fetch XML data
response = requests.get(url)
response.raise_for_status()
# Parse XML
xml_doc = etree.fromstring(response.content)
# Load XSLT stylesheet
xslt_doc = etree.fromstring(xslt_stylesheet)
transform = etree.XSLT(xslt_doc)
# Apply transformation
result = transform(xml_doc)
return str(result)
except requests.RequestException as e:
print(f"HTTP request failed: {e}")
except etree.XSLTApplyError as e:
print(f"XSLT transformation failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Common XSLT Patterns and Use Cases
Filtering and Sorting Data
<!-- XSLT for filtering and sorting books by price -->
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<html>
<body>
<h1>Books under $35 (sorted by price)</h1>
<xsl:for-each select="//book[price < 35]">
<xsl:sort select="price" data-type="number"/>
<div>
<strong><xsl:value-of select="title"/></strong> -
$<xsl:value-of select="price"/>
</div>
</xsl:for-each>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Grouping Data
<!-- XSLT for grouping books by category -->
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:key name="books-by-category" match="book" use="category"/>
<xsl:template match="/">
<html>
<body>
<h1>Books by Category</h1>
<xsl:for-each select="//book[generate-id() = generate-id(key('books-by-category', category)[1])]">
<h2><xsl:value-of select="category"/></h2>
<ul>
<xsl:for-each select="key('books-by-category', category)">
<li><xsl:value-of select="title"/> by <xsl:value-of select="author"/></li>
</xsl:for-each>
</ul>
</xsl:for-each>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Troubleshooting Common Issues
Memory Management
For large XML documents, consider using incremental parsing:
from lxml import etree
def transform_large_xml(xml_file, xslt_file):
"""Transform large XML files efficiently."""
# Use iterparse for memory efficiency
context = etree.iterparse(xml_file, events=('start', 'end'))
context = iter(context)
event, root = next(context)
# Process in chunks or apply transformation to smaller sections
# This approach depends on your specific XML structure
# Load XSLT
xslt_doc = etree.parse(xslt_file)
transform = etree.XSLT(xslt_doc)
# Apply transformation to complete document
xml_doc = etree.parse(xml_file)
result = transform(xml_doc)
return str(result)
Best Practices
- Validate Input: Always validate both XML and XSLT documents before transformation
- Cache Transformers: Reuse XSLT transformer objects for better performance
- Handle Errors Gracefully: Implement proper error handling for various failure scenarios
- Use Parameters: Make your XSLT stylesheets flexible with parameters
- Memory Management: Be mindful of memory usage with large XML documents
- Security: Be cautious with user-provided XSLT stylesheets as they can contain malicious code
The lxml library provides a powerful and efficient way to perform XSLT transformations in Python. Whether you're transforming scraped XML data into more usable formats or implementing complex data processing pipelines, understanding these techniques will help you build robust and efficient applications. When working with custom parser options when using lxml, consider using incremental parsing strategies combined with your XSLT transformations for optimal performance.