Table of contents

How do I create new XML or HTML documents from scratch using lxml?

The lxml library provides powerful capabilities for creating new XML and HTML documents programmatically. Whether you need to generate configuration files, RSS feeds, or HTML reports, lxml's ElementTree API offers a clean and intuitive way to build structured documents from the ground up.

Understanding lxml's Document Creation Approach

lxml uses the ElementTree API to represent XML and HTML documents as hierarchical tree structures. Each element in the document is an Element object that can contain text, attributes, and child elements. This approach makes it straightforward to build complex documents programmatically.

Creating Basic XML Documents

Simple XML Document Creation

Here's how to create a basic XML document using lxml:

from lxml import etree

# Create the root element
root = etree.Element("catalog")

# Add child elements
book = etree.SubElement(root, "book")
book.set("id", "1")

# Add text content to elements
title = etree.SubElement(book, "title")
title.text = "Python Web Scraping Guide"

author = etree.SubElement(book, "author")
author.text = "John Developer"

price = etree.SubElement(book, "price")
price.set("currency", "USD")
price.text = "29.99"

# Create the document tree
tree = etree.ElementTree(root)

# Write to file
tree.write("catalog.xml", encoding="utf-8", xml_declaration=True, pretty_print=True)

This creates an XML file with the following structure:

<?xml version='1.0' encoding='UTF-8'?>
<catalog>
  <book id="1">
    <title>Python Web Scraping Guide</title>
    <author>John Developer</author>
    <price currency="USD">29.99</price>
  </book>
</catalog>

Creating XML with Namespaces

For more complex XML documents that require namespaces:

from lxml import etree

# Define namespaces
NSMAP = {
    None: "http://example.com/catalog",  # Default namespace
    "dc": "http://purl.org/dc/elements/1.1/"
}

# Create root with namespace
root = etree.Element("catalog", nsmap=NSMAP)

# Add elements with namespace prefixes
book = etree.SubElement(root, "book")
title = etree.SubElement(book, "title")
title.text = "Web Scraping with Python"

# Add element with explicit namespace
creator = etree.SubElement(book, "{http://purl.org/dc/elements/1.1/}creator")
creator.text = "Jane Developer"

# Output the document
print(etree.tostring(root, encoding="unicode", pretty_print=True))

Creating HTML Documents

Basic HTML Document Structure

Creating HTML documents follows a similar pattern but uses HTML-specific elements:

from lxml import etree, html

# Create HTML document structure
html_doc = etree.Element("html")
head = etree.SubElement(html_doc, "head")
body = etree.SubElement(html_doc, "body")

# Add meta information
title = etree.SubElement(head, "title")
title.text = "Web Scraping Report"

meta_charset = etree.SubElement(head, "meta")
meta_charset.set("charset", "UTF-8")

# Add body content
h1 = etree.SubElement(body, "h1")
h1.text = "Scraping Results"

div = etree.SubElement(body, "div")
div.set("class", "content")

p = etree.SubElement(div, "p")
p.text = "This report contains the latest scraping results."

# Create a table
table = etree.SubElement(div, "table")
table.set("border", "1")

# Table header
thead = etree.SubElement(table, "thead")
tr_head = etree.SubElement(thead, "tr")
th1 = etree.SubElement(tr_head, "th")
th1.text = "URL"
th2 = etree.SubElement(tr_head, "th")
th2.text = "Status"

# Table body
tbody = etree.SubElement(table, "tbody")
tr_body = etree.SubElement(tbody, "tr")
td1 = etree.SubElement(tr_body, "td")
td1.text = "https://example.com"
td2 = etree.SubElement(tr_body, "td")
td2.text = "Success"

# Write HTML file
tree = etree.ElementTree(html_doc)
tree.write("report.html", encoding="utf-8", method="html", pretty_print=True)

Using HTML Builder for Complex Documents

For more complex HTML documents, you can use lxml's HTML builder:

from lxml.html import builder as E
from lxml import etree

# Create HTML using the builder pattern
doc = E.HTML(
    E.HEAD(
        E.TITLE("Data Visualization Dashboard"),
        E.META(charset="UTF-8"),
        E.LINK(rel="stylesheet", href="styles.css")
    ),
    E.BODY(
        E.DIV(
            E.H1("Analytics Dashboard", CLASS="header"),
            E.DIV(
                E.H2("Key Metrics"),
                E.UL(
                    E.LI("Total Requests: 1,234"),
                    E.LI("Success Rate: 98.5%"),
                    E.LI("Average Response Time: 245ms")
                ),
                CLASS="metrics"
            ),
            E.DIV(
                E.H2("Recent Activity"),
                E.TABLE(
                    E.TR(E.TH("Time"), E.TH("Action"), E.TH("Result")),
                    E.TR(E.TD("10:30"), E.TD("Page Scan"), E.TD("Complete")),
                    E.TR(E.TD("10:35"), E.TD("Data Extract"), E.TD("Complete")),
                    CLASS="activity-table"
                ),
                CLASS="activity"
            ),
            CLASS="container"
        )
    )
)

# Convert to string and save
html_content = etree.tostring(doc, encoding="unicode", method="html", pretty_print=True)
with open("dashboard.html", "w", encoding="utf-8") as f:
    f.write("<!DOCTYPE html>\n" + html_content)

Advanced Document Creation Techniques

Creating Documents with CDATA Sections

For XML documents that need to include unescaped content:

from lxml import etree

root = etree.Element("configuration")
script_section = etree.SubElement(root, "script")

# Add CDATA section
script_content = """
function validateData(data) {
    return data && data.length > 0;
}
"""
script_section.text = etree.CDATA(script_content)

print(etree.tostring(root, encoding="unicode", pretty_print=True))

Dynamic Document Generation

Creating documents based on data structures:

from lxml import etree

def create_product_catalog(products):
    """Create XML catalog from product data"""
    root = etree.Element("catalog")

    for product in products:
        product_elem = etree.SubElement(root, "product")
        product_elem.set("id", str(product["id"]))

        for key, value in product.items():
            if key != "id":
                elem = etree.SubElement(product_elem, key)
                elem.text = str(value)

    return etree.ElementTree(root)

# Sample data
products = [
    {"id": 1, "name": "Laptop", "price": 999.99, "category": "Electronics"},
    {"id": 2, "name": "Book", "price": 19.99, "category": "Literature"},
]

catalog = create_product_catalog(products)
catalog.write("products.xml", encoding="utf-8", xml_declaration=True, pretty_print=True)

Creating RSS Feeds

XML document creation is particularly useful for generating RSS feeds:

from lxml import etree
from datetime import datetime

def create_rss_feed(title, description, link, items):
    """Create RSS 2.0 feed"""
    rss = etree.Element("rss")
    rss.set("version", "2.0")

    channel = etree.SubElement(rss, "channel")

    # Channel metadata
    channel_title = etree.SubElement(channel, "title")
    channel_title.text = title

    channel_desc = etree.SubElement(channel, "description")
    channel_desc.text = description

    channel_link = etree.SubElement(channel, "link")
    channel_link.text = link

    # Add items
    for item_data in items:
        item = etree.SubElement(channel, "item")

        item_title = etree.SubElement(item, "title")
        item_title.text = item_data["title"]

        item_desc = etree.SubElement(item, "description")
        item_desc.text = item_data["description"]

        item_link = etree.SubElement(item, "link")
        item_link.text = item_data["link"]

        pub_date = etree.SubElement(item, "pubDate")
        pub_date.text = item_data["pub_date"]

    return etree.ElementTree(rss)

# Create feed
feed_items = [
    {
        "title": "New Web Scraping Tutorial",
        "description": "Learn advanced scraping techniques",
        "link": "https://example.com/tutorial",
        "pub_date": "Mon, 01 Jan 2024 12:00:00 GMT"
    }
]

rss_feed = create_rss_feed(
    "Tech Blog",
    "Latest web development tutorials",
    "https://example.com",
    feed_items
)

rss_feed.write("feed.xml", encoding="utf-8", xml_declaration=True, pretty_print=True)

Best Practices and Performance Tips

Memory Management

When creating large documents, consider memory usage:

from lxml import etree

def create_large_document_efficiently():
    """Create large XML document with memory efficiency"""
    root = etree.Element("data")

    # Process data in chunks to avoid memory issues
    for batch in range(100):  # Process 100 batches
        batch_elem = etree.SubElement(root, "batch")
        batch_elem.set("id", str(batch))

        for i in range(1000):  # 1000 items per batch
            item = etree.SubElement(batch_elem, "item")
            item.text = f"Item {batch * 1000 + i}"

        # Optionally clear processed elements to save memory
        # batch_elem.clear()

    return etree.ElementTree(root)

Validation and Error Handling

Always validate your created documents:

from lxml import etree

def create_and_validate_xml():
    """Create XML with validation"""
    try:
        root = etree.Element("document")

        # Add content
        content = etree.SubElement(root, "content")
        content.text = "Sample content"

        # Validate structure
        tree = etree.ElementTree(root)

        # Basic validation - check if document is well-formed
        xml_string = etree.tostring(tree, encoding="unicode")
        etree.fromstring(xml_string)  # Will raise exception if not well-formed

        return tree

    except etree.XMLSyntaxError as e:
        print(f"XML Syntax Error: {e}")
        return None
    except Exception as e:
        print(f"Error creating document: {e}")
        return None

Integration with Web Scraping Workflows

When building documents from scraped data, you might need to combine lxml's document creation with its parsing capabilities. While this article focuses on creating documents from scratch, you may also want to learn about parsing existing XML documents with lxml for more comprehensive XML handling workflows.

For web scraping projects that generate reports or export data, creating structured documents is essential. The techniques shown here work well with data collected through various scraping methods and can be integrated into automated reporting systems.

Conclusion

Creating XML and HTML documents from scratch using lxml is straightforward once you understand the ElementTree API. The library provides excellent performance and flexibility for generating everything from simple configuration files to complex HTML reports and RSS feeds. Remember to handle encoding properly, validate your output, and consider memory usage when working with large documents.

Whether you're generating reports from scraped data, creating configuration files, or building dynamic web content, lxml's document creation capabilities provide a robust foundation for your XML and HTML generation needs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon