How do I create new XML or HTML documents from scratch using lxml?
The lxml library provides powerful capabilities for creating new XML and HTML documents programmatically. Whether you need to generate configuration files, RSS feeds, or HTML reports, lxml's ElementTree API offers a clean and intuitive way to build structured documents from the ground up.
Understanding lxml's Document Creation Approach
lxml uses the ElementTree API to represent XML and HTML documents as hierarchical tree structures. Each element in the document is an Element
object that can contain text, attributes, and child elements. This approach makes it straightforward to build complex documents programmatically.
Creating Basic XML Documents
Simple XML Document Creation
Here's how to create a basic XML document using lxml:
from lxml import etree
# Create the root element
root = etree.Element("catalog")
# Add child elements
book = etree.SubElement(root, "book")
book.set("id", "1")
# Add text content to elements
title = etree.SubElement(book, "title")
title.text = "Python Web Scraping Guide"
author = etree.SubElement(book, "author")
author.text = "John Developer"
price = etree.SubElement(book, "price")
price.set("currency", "USD")
price.text = "29.99"
# Create the document tree
tree = etree.ElementTree(root)
# Write to file
tree.write("catalog.xml", encoding="utf-8", xml_declaration=True, pretty_print=True)
This creates an XML file with the following structure:
<?xml version='1.0' encoding='UTF-8'?>
<catalog>
<book id="1">
<title>Python Web Scraping Guide</title>
<author>John Developer</author>
<price currency="USD">29.99</price>
</book>
</catalog>
Creating XML with Namespaces
For more complex XML documents that require namespaces:
from lxml import etree
# Define namespaces
NSMAP = {
None: "http://example.com/catalog", # Default namespace
"dc": "http://purl.org/dc/elements/1.1/"
}
# Create root with namespace
root = etree.Element("catalog", nsmap=NSMAP)
# Add elements with namespace prefixes
book = etree.SubElement(root, "book")
title = etree.SubElement(book, "title")
title.text = "Web Scraping with Python"
# Add element with explicit namespace
creator = etree.SubElement(book, "{http://purl.org/dc/elements/1.1/}creator")
creator.text = "Jane Developer"
# Output the document
print(etree.tostring(root, encoding="unicode", pretty_print=True))
Creating HTML Documents
Basic HTML Document Structure
Creating HTML documents follows a similar pattern but uses HTML-specific elements:
from lxml import etree, html
# Create HTML document structure
html_doc = etree.Element("html")
head = etree.SubElement(html_doc, "head")
body = etree.SubElement(html_doc, "body")
# Add meta information
title = etree.SubElement(head, "title")
title.text = "Web Scraping Report"
meta_charset = etree.SubElement(head, "meta")
meta_charset.set("charset", "UTF-8")
# Add body content
h1 = etree.SubElement(body, "h1")
h1.text = "Scraping Results"
div = etree.SubElement(body, "div")
div.set("class", "content")
p = etree.SubElement(div, "p")
p.text = "This report contains the latest scraping results."
# Create a table
table = etree.SubElement(div, "table")
table.set("border", "1")
# Table header
thead = etree.SubElement(table, "thead")
tr_head = etree.SubElement(thead, "tr")
th1 = etree.SubElement(tr_head, "th")
th1.text = "URL"
th2 = etree.SubElement(tr_head, "th")
th2.text = "Status"
# Table body
tbody = etree.SubElement(table, "tbody")
tr_body = etree.SubElement(tbody, "tr")
td1 = etree.SubElement(tr_body, "td")
td1.text = "https://example.com"
td2 = etree.SubElement(tr_body, "td")
td2.text = "Success"
# Write HTML file
tree = etree.ElementTree(html_doc)
tree.write("report.html", encoding="utf-8", method="html", pretty_print=True)
Using HTML Builder for Complex Documents
For more complex HTML documents, you can use lxml's HTML builder:
from lxml.html import builder as E
from lxml import etree
# Create HTML using the builder pattern
doc = E.HTML(
E.HEAD(
E.TITLE("Data Visualization Dashboard"),
E.META(charset="UTF-8"),
E.LINK(rel="stylesheet", href="styles.css")
),
E.BODY(
E.DIV(
E.H1("Analytics Dashboard", CLASS="header"),
E.DIV(
E.H2("Key Metrics"),
E.UL(
E.LI("Total Requests: 1,234"),
E.LI("Success Rate: 98.5%"),
E.LI("Average Response Time: 245ms")
),
CLASS="metrics"
),
E.DIV(
E.H2("Recent Activity"),
E.TABLE(
E.TR(E.TH("Time"), E.TH("Action"), E.TH("Result")),
E.TR(E.TD("10:30"), E.TD("Page Scan"), E.TD("Complete")),
E.TR(E.TD("10:35"), E.TD("Data Extract"), E.TD("Complete")),
CLASS="activity-table"
),
CLASS="activity"
),
CLASS="container"
)
)
)
# Convert to string and save
html_content = etree.tostring(doc, encoding="unicode", method="html", pretty_print=True)
with open("dashboard.html", "w", encoding="utf-8") as f:
f.write("<!DOCTYPE html>\n" + html_content)
Advanced Document Creation Techniques
Creating Documents with CDATA Sections
For XML documents that need to include unescaped content:
from lxml import etree
root = etree.Element("configuration")
script_section = etree.SubElement(root, "script")
# Add CDATA section
script_content = """
function validateData(data) {
return data && data.length > 0;
}
"""
script_section.text = etree.CDATA(script_content)
print(etree.tostring(root, encoding="unicode", pretty_print=True))
Dynamic Document Generation
Creating documents based on data structures:
from lxml import etree
def create_product_catalog(products):
"""Create XML catalog from product data"""
root = etree.Element("catalog")
for product in products:
product_elem = etree.SubElement(root, "product")
product_elem.set("id", str(product["id"]))
for key, value in product.items():
if key != "id":
elem = etree.SubElement(product_elem, key)
elem.text = str(value)
return etree.ElementTree(root)
# Sample data
products = [
{"id": 1, "name": "Laptop", "price": 999.99, "category": "Electronics"},
{"id": 2, "name": "Book", "price": 19.99, "category": "Literature"},
]
catalog = create_product_catalog(products)
catalog.write("products.xml", encoding="utf-8", xml_declaration=True, pretty_print=True)
Creating RSS Feeds
XML document creation is particularly useful for generating RSS feeds:
from lxml import etree
from datetime import datetime
def create_rss_feed(title, description, link, items):
"""Create RSS 2.0 feed"""
rss = etree.Element("rss")
rss.set("version", "2.0")
channel = etree.SubElement(rss, "channel")
# Channel metadata
channel_title = etree.SubElement(channel, "title")
channel_title.text = title
channel_desc = etree.SubElement(channel, "description")
channel_desc.text = description
channel_link = etree.SubElement(channel, "link")
channel_link.text = link
# Add items
for item_data in items:
item = etree.SubElement(channel, "item")
item_title = etree.SubElement(item, "title")
item_title.text = item_data["title"]
item_desc = etree.SubElement(item, "description")
item_desc.text = item_data["description"]
item_link = etree.SubElement(item, "link")
item_link.text = item_data["link"]
pub_date = etree.SubElement(item, "pubDate")
pub_date.text = item_data["pub_date"]
return etree.ElementTree(rss)
# Create feed
feed_items = [
{
"title": "New Web Scraping Tutorial",
"description": "Learn advanced scraping techniques",
"link": "https://example.com/tutorial",
"pub_date": "Mon, 01 Jan 2024 12:00:00 GMT"
}
]
rss_feed = create_rss_feed(
"Tech Blog",
"Latest web development tutorials",
"https://example.com",
feed_items
)
rss_feed.write("feed.xml", encoding="utf-8", xml_declaration=True, pretty_print=True)
Best Practices and Performance Tips
Memory Management
When creating large documents, consider memory usage:
from lxml import etree
def create_large_document_efficiently():
"""Create large XML document with memory efficiency"""
root = etree.Element("data")
# Process data in chunks to avoid memory issues
for batch in range(100): # Process 100 batches
batch_elem = etree.SubElement(root, "batch")
batch_elem.set("id", str(batch))
for i in range(1000): # 1000 items per batch
item = etree.SubElement(batch_elem, "item")
item.text = f"Item {batch * 1000 + i}"
# Optionally clear processed elements to save memory
# batch_elem.clear()
return etree.ElementTree(root)
Validation and Error Handling
Always validate your created documents:
from lxml import etree
def create_and_validate_xml():
"""Create XML with validation"""
try:
root = etree.Element("document")
# Add content
content = etree.SubElement(root, "content")
content.text = "Sample content"
# Validate structure
tree = etree.ElementTree(root)
# Basic validation - check if document is well-formed
xml_string = etree.tostring(tree, encoding="unicode")
etree.fromstring(xml_string) # Will raise exception if not well-formed
return tree
except etree.XMLSyntaxError as e:
print(f"XML Syntax Error: {e}")
return None
except Exception as e:
print(f"Error creating document: {e}")
return None
Integration with Web Scraping Workflows
When building documents from scraped data, you might need to combine lxml's document creation with its parsing capabilities. While this article focuses on creating documents from scratch, you may also want to learn about parsing existing XML documents with lxml for more comprehensive XML handling workflows.
For web scraping projects that generate reports or export data, creating structured documents is essential. The techniques shown here work well with data collected through various scraping methods and can be integrated into automated reporting systems.
Conclusion
Creating XML and HTML documents from scratch using lxml is straightforward once you understand the ElementTree API. The library provides excellent performance and flexibility for generating everything from simple configuration files to complex HTML reports and RSS feeds. Remember to handle encoding properly, validate your output, and consider memory usage when working with large documents.
Whether you're generating reports from scraped data, creating configuration files, or building dynamic web content, lxml's document creation capabilities provide a robust foundation for your XML and HTML generation needs.