Table of contents

Is there a way to pretty print HTML or XML with lxml?

Yes, lxml provides excellent built-in support for pretty-printing both HTML and XML documents. The pretty_print parameter in the tostring() function automatically formats your output with proper indentation and line breaks.

Pretty-Printing XML

For XML documents, use etree.tostring() with pretty_print=True:

from lxml import etree

# Parse XML from string
xml_data = '''<root><child1>data1</child1><child2><subchild>nested</subchild></child2></root>'''
root = etree.fromstring(xml_data)

# Pretty-print the XML
pretty_xml = etree.tostring(root, pretty_print=True, encoding='unicode')
print(pretty_xml)

Output:

<root>
  <child1>data1</child1>
  <child2>
    <subchild>nested</subchild>
  </child2>
</root>

Pretty-Printing HTML

For HTML documents, use lxml.html with the same approach:

from lxml import html

# Parse HTML
html_data = '''<html><head><title>Test</title></head><body><div><p>Hello World</p><span>Content</span></div></body></html>'''
root = html.fromstring(html_data)

# Pretty-print HTML
pretty_html = html.tostring(root, pretty_print=True, encoding='unicode', method='html')
print(pretty_html)

Working with Files

You can also pretty-print documents loaded from files:

from lxml import etree

# Parse from file
tree = etree.parse('document.xml')

# Pretty-print to file
with open('formatted_document.xml', 'wb') as f:
    tree.write(f, pretty_print=True, encoding='utf-8', xml_declaration=True)

# Or print to console
print(etree.tostring(tree, pretty_print=True, encoding='unicode'))

Customizing Output Format

You can control various aspects of the pretty-printed output:

from lxml import etree

root = etree.fromstring('<root><item>test</item></root>')

# With XML declaration
formatted = etree.tostring(
    root, 
    pretty_print=True, 
    encoding='utf-8',
    xml_declaration=True
)

# Custom method for HTML
from lxml import html
html_root = html.fromstring('<div><p>content</p></div>')
formatted_html = html.tostring(
    html_root,
    pretty_print=True,
    encoding='unicode',
    method='html',
    doctype='<!DOCTYPE html>'
)

Important Considerations

  • HTML Auto-correction: When pretty-printing HTML with lxml.html, the library automatically corrects malformed tags and ensures valid HTML structure
  • Encoding: Use encoding='unicode' to get a string output, or specify 'utf-8' for bytes
  • Performance: Pretty-printing adds processing overhead, so avoid it in performance-critical applications
  • Whitespace: Pretty-printing may add whitespace that could affect rendering in some contexts

The pretty-print functionality in lxml is particularly useful for debugging, logging, or when you need human-readable output from your XML/HTML processing tasks.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon