Table of contents

Is there a way to pretty-print HTML or XML with Beautiful Soup?

Yes, Beautiful Soup provides a built-in method to pretty-print HTML or XML. The prettify() method formats the parsed document with proper indentation, making the structure more readable by indenting each tag according to its level in the document tree.

Basic Pretty-Printing

Here's how to use the prettify() method:

from bs4 import BeautifulSoup

# Sample HTML content (minified/unformatted)
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p></body></html>"""

# Parse the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')

# Pretty-print the HTML
print(soup.prettify())

Output:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
 </body>
</html>

Custom Indentation

You can customize the indentation by passing a string to the prettify() method:

# Default indentation (single space)
print(soup.prettify())

# Use four spaces for indentation
print(soup.prettify("    "))

# Use tabs for indentation
print(soup.prettify("\t"))

# Use two spaces for indentation
print(soup.prettify("  "))

Pretty-Printing Specific Elements

You can also pretty-print individual elements rather than the entire document:

from bs4 import BeautifulSoup

html = "<div><p>Hello <strong>world</strong></p><ul><li>Item 1</li><li>Item 2</li></ul></div>"
soup = BeautifulSoup(html, 'html.parser')

# Pretty-print just the div element
div_element = soup.find('div')
print(div_element.prettify())

# Pretty-print just the ul element
ul_element = soup.find('ul')
print(ul_element.prettify("  "))  # Custom indentation

Pretty-Printing XML

The prettify() method works equally well with XML documents:

from bs4 import BeautifulSoup

xml_doc = """<?xml version="1.0" encoding="UTF-8"?><catalog><book id="1"><title>Python Web Scraping</title><author>John Doe</author><price>29.99</price></book><book id="2"><title>Beautiful Soup Guide</title><author>Jane Smith</author><price>24.99</price></book></catalog>"""

# Parse XML content
soup = BeautifulSoup(xml_doc, 'xml')

# Pretty-print the XML
print(soup.prettify())

Output:

<?xml version="1.0" encoding="utf-8"?>
<catalog>
 <book id="1">
  <title>
   Python Web Scraping
  </title>
  <author>
   John Doe
  </author>
  <price>
   29.99
  </price>
 </book>
 <book id="2">
  <title>
   Beautiful Soup Guide
  </title>
  <author>
   Jane Smith
  </author>
  <price>
   24.99
  </price>
 </book>
</catalog>

Practical Use Cases

Debugging and Development

from bs4 import BeautifulSoup
import requests

# Fetch and pretty-print a webpage for debugging
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')

# Print formatted HTML to understand structure
print(soup.prettify())

Saving Formatted HTML to File

from bs4 import BeautifulSoup

html = "<html><body><h1>Title</h1><p>Content</p></body></html>"
soup = BeautifulSoup(html, 'html.parser')

# Save pretty-printed HTML to file
with open('formatted_output.html', 'w', encoding='utf-8') as f:
    f.write(soup.prettify())

Comparing Before and After Modifications

from bs4 import BeautifulSoup

html = "<div><p>Original content</p></div>"
soup = BeautifulSoup(html, 'html.parser')

print("Before modification:")
print(soup.prettify())

# Modify the content
soup.find('p').string = "Modified content"
soup.find('div').append(soup.new_tag('span'))
soup.find('span').string = "New element"

print("\nAfter modification:")
print(soup.prettify())

Important Notes

  • Performance: The prettify() method is primarily for human readability and debugging. Avoid using it in production code where performance matters.
  • File Size: Pretty-printed HTML/XML is larger due to added whitespace, which can impact storage and transmission.
  • Whitespace Sensitivity: Some HTML elements (like <pre> or <code>) are sensitive to whitespace changes.
  • Character Encoding: Always specify proper encoding when writing pretty-printed content to files.

The prettify() method is an invaluable tool for debugging web scraping scripts, understanding document structure, and creating readable HTML/XML output during development.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon