Yes, Beautiful Soup provides a built-in method to pretty-print HTML or XML. The prettify()
method formats the parsed document with proper indentation, making the structure more readable by indenting each tag according to its level in the document tree.
Basic Pretty-Printing
Here's how to use the prettify()
method:
from bs4 import BeautifulSoup
# Sample HTML content (minified/unformatted)
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p></body></html>"""
# Parse the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')
# Pretty-print the HTML
print(soup.prettify())
Output:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
</body>
</html>
Custom Indentation
You can customize the indentation by passing a string to the prettify()
method:
# Default indentation (single space)
print(soup.prettify())
# Use four spaces for indentation
print(soup.prettify(" "))
# Use tabs for indentation
print(soup.prettify("\t"))
# Use two spaces for indentation
print(soup.prettify(" "))
Pretty-Printing Specific Elements
You can also pretty-print individual elements rather than the entire document:
from bs4 import BeautifulSoup
html = "<div><p>Hello <strong>world</strong></p><ul><li>Item 1</li><li>Item 2</li></ul></div>"
soup = BeautifulSoup(html, 'html.parser')
# Pretty-print just the div element
div_element = soup.find('div')
print(div_element.prettify())
# Pretty-print just the ul element
ul_element = soup.find('ul')
print(ul_element.prettify(" ")) # Custom indentation
Pretty-Printing XML
The prettify()
method works equally well with XML documents:
from bs4 import BeautifulSoup
xml_doc = """<?xml version="1.0" encoding="UTF-8"?><catalog><book id="1"><title>Python Web Scraping</title><author>John Doe</author><price>29.99</price></book><book id="2"><title>Beautiful Soup Guide</title><author>Jane Smith</author><price>24.99</price></book></catalog>"""
# Parse XML content
soup = BeautifulSoup(xml_doc, 'xml')
# Pretty-print the XML
print(soup.prettify())
Output:
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<book id="1">
<title>
Python Web Scraping
</title>
<author>
John Doe
</author>
<price>
29.99
</price>
</book>
<book id="2">
<title>
Beautiful Soup Guide
</title>
<author>
Jane Smith
</author>
<price>
24.99
</price>
</book>
</catalog>
Practical Use Cases
Debugging and Development
from bs4 import BeautifulSoup
import requests
# Fetch and pretty-print a webpage for debugging
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
# Print formatted HTML to understand structure
print(soup.prettify())
Saving Formatted HTML to File
from bs4 import BeautifulSoup
html = "<html><body><h1>Title</h1><p>Content</p></body></html>"
soup = BeautifulSoup(html, 'html.parser')
# Save pretty-printed HTML to file
with open('formatted_output.html', 'w', encoding='utf-8') as f:
f.write(soup.prettify())
Comparing Before and After Modifications
from bs4 import BeautifulSoup
html = "<div><p>Original content</p></div>"
soup = BeautifulSoup(html, 'html.parser')
print("Before modification:")
print(soup.prettify())
# Modify the content
soup.find('p').string = "Modified content"
soup.find('div').append(soup.new_tag('span'))
soup.find('span').string = "New element"
print("\nAfter modification:")
print(soup.prettify())
Important Notes
- Performance: The
prettify()
method is primarily for human readability and debugging. Avoid using it in production code where performance matters. - File Size: Pretty-printed HTML/XML is larger due to added whitespace, which can impact storage and transmission.
- Whitespace Sensitivity: Some HTML elements (like
<pre>
or<code>
) are sensitive to whitespace changes. - Character Encoding: Always specify proper encoding when writing pretty-printed content to files.
The prettify()
method is an invaluable tool for debugging web scraping scripts, understanding document structure, and creating readable HTML/XML output during development.