Is there a way to pretty-print HTML or XML with Beautiful Soup?

Yes, Beautiful Soup provides built-in methods to pretty-print HTML or XML. The prettify() method in Beautiful Soup outputs the parsed document formatted with indentation to make the structure more readable. Each tag is indented according to its level in the tree.

Here's how to use the prettify() method in Beautiful Soup:

from bs4 import BeautifulSoup

# Sample HTML content
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

# Parse the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')

# Pretty-print the HTML
print(soup.prettify())

When you run this code, you'll receive the HTML content as output, formatted with appropriate indentation to reflect the document's structure.

Beautiful Soup also allows you to specify the indentation character and the width of the indentation (number of characters) by passing additional arguments to prettify(). By default, prettify() uses a single space for indentation. However, if you want to use a tab or multiple spaces, you can specify it like so:

print(soup.prettify(formatter=None))  # Use a space for indentation (default)
print(soup.prettify('    '))          # Use four spaces for indentation
print(soup.prettify('\t'))            # Use a tab for indentation

Note: The prettify() method is intended for human readability. When saving or processing HTML/XML, it is often unnecessary to pretty-print the content, as whitespace does not affect the rendering or data extraction and can increase file size.

If you want to pretty-print HTML or XML in JavaScript, you would generally use browser-based APIs or third-party libraries, as Node.js does not have a built-in HTML parser like Beautiful Soup. However, for demonstration purposes, here's how you might use the xml-formatter package to pretty-print XML in Node.js:

const formatter = require('xml-formatter');

let xml = `<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>`;

let formattedXml = formatter(xml, {
    indentation: '  ', // Set the indentation to two spaces
    lineSeparator: '\n' // Use newline as line separator
});

console.log(formattedXml);

To use the above JavaScript code, you need to install the xml-formatter package first:

npm install xml-formatter

Remember that these methods are primarily for display and debugging purposes. When scraping web content, it's typically not necessary to pretty-print the HTML or XML, as it doesn't change the data you're extracting.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon