Is there a way to pretty print HTML or XML with lxml?

Yes, lxml provides excellent built-in support for pretty-printing both HTML and XML documents. The pretty_print parameter in the tostring() function automatically formats your output with proper indentation and line breaks.

Pretty-Printing XML

For XML documents, use etree.tostring() with pretty_print=True:

from lxml import etree

# Parse XML from string
xml_data = '''<root><child1>data1</child1><child2><subchild>nested</subchild></child2></root>'''
root = etree.fromstring(xml_data)

# Pretty-print the XML
pretty_xml = etree.tostring(root, pretty_print=True, encoding='unicode')
print(pretty_xml)

Output:

<root>
  <child1>data1</child1>
  <child2>
    <subchild>nested</subchild>
  </child2>
</root>

Pretty-Printing HTML

For HTML documents, use lxml.html with the same approach:

from lxml import html

# Parse HTML
html_data = '''<html><head><title>Test</title></head><body><div><p>Hello World</p><span>Content</span></div></body></html>'''
root = html.fromstring(html_data)

# Pretty-print HTML
pretty_html = html.tostring(root, pretty_print=True, encoding='unicode', method='html')
print(pretty_html)

Working with Files

You can also pretty-print documents loaded from files:

from lxml import etree

# Parse from file
tree = etree.parse('document.xml')

# Pretty-print to file
with open('formatted_document.xml', 'wb') as f:
    tree.write(f, pretty_print=True, encoding='utf-8', xml_declaration=True)

# Or print to console
print(etree.tostring(tree, pretty_print=True, encoding='unicode'))

Customizing Output Format

You can control various aspects of the pretty-printed output:

from lxml import etree

root = etree.fromstring('<root><item>test</item></root>')

# With XML declaration
formatted = etree.tostring(
    root, 
    pretty_print=True, 
    encoding='utf-8',
    xml_declaration=True
)

# Custom method for HTML
from lxml import html
html_root = html.fromstring('<div><p>content</p></div>')
formatted_html = html.tostring(
    html_root,
    pretty_print=True,
    encoding='unicode',
    method='html',
    doctype='<!DOCTYPE html>'
)

Important Considerations

HTML Auto-correction: When pretty-printing HTML with lxml.html, the library automatically corrects malformed tags and ensures valid HTML structure
Encoding: Use encoding='unicode' to get a string output, or specify 'utf-8' for bytes
Performance: Pretty-printing adds processing overhead, so avoid it in performance-critical applications
Whitespace: Pretty-printing may add whitespace that could affect rendering in some contexts

The pretty-print functionality in lxml is particularly useful for debugging, logging, or when you need human-readable output from your XML/HTML processing tasks.

Table of contents

Is there a way to pretty print HTML or XML with lxml?

Pretty-Printing XML

Pretty-Printing HTML

Working with Files

Customizing Output Format

Important Considerations

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Get Started Now