Is there a way to pretty print HTML or XML with lxml?

Yes, lxml provides a straightforward way to pretty-print HTML or XML. When parsing the document, you can use the pretty_print option of functions like tostring to format the output with indentation.

Here's how you can pretty-print XML with lxml:

from lxml import etree

# Parse the XML
xml_data = '''<root><child1>data1</child1><child2>data2</child2></root>'''
root = etree.fromstring(xml_data)

# Pretty-print the XML
pretty_xml_string = etree.tostring(root, pretty_print=True, encoding='unicode')
print(pretty_xml_string)

And for HTML, you can use lxml.html:

from lxml import html

# Parse the HTML
html_data = '''<html><head><title>Test</title></head><body><p>Hello World</p></body></html>'''
root = html.fromstring(html_data)

# Pretty-print the HTML
pretty_html_string = html.tostring(root, pretty_print=True, encoding='unicode', method='html')
print(pretty_html_string)

Keep in mind that pretty-printing HTML with lxml.html will try to correct any malformed tags to ensure the output is valid HTML. This might result in a slightly different structure if your input HTML is not well-formed.

lxml is a powerful library and provides much more control over the parsing and serialization of XML and HTML documents, but the above examples should cover the basic use case of pretty-printing.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon