Yes, Beautiful Soup automatically converts HTML entity references to their corresponding Unicode characters when parsing HTML documents. This automatic conversion happens for all types of HTML entities, including named entities (&
, <
, >
), numeric entities ({
), and hexadecimal entities ({
).
How Entity Conversion Works
When Beautiful Soup parses HTML, it automatically decodes entity references in the text content. This means you get clean, readable text without needing to manually handle entity conversion.
from bs4 import BeautifulSoup
# HTML with various entity types
html_doc = """
<p>Company: Johnson & Johnson</p>
<p>Formula: 5 < 10 > 3</p>
<p>Copyright: © 2024</p>
<p>Numeric: € (Euro symbol)</p>
<p>Hex: ☃ (Snowman)</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Extract text - entities are automatically converted
for p in soup.find_all('p'):
print(p.get_text())
# Output:
# Company: Johnson & Johnson
# Formula: 5 < 10 > 3
# Copyright: © 2024
# Numeric: € (Euro symbol)
# Hex: ☃ (Snowman)
Common HTML Entities
Beautiful Soup handles all standard HTML entities automatically:
| Entity | Character | Description |
|--------|-----------|-------------|
| &
| & | Ampersand |
| <
| < | Less than |
| >
| > | Greater than |
| "
| " | Quotation mark |
| '
| ' | Apostrophe |
|
| (space) | Non-breaking space |
| ©
| © | Copyright |
| ®
| ® | Registered trademark |
Parser Differences
All Beautiful Soup parsers handle entity conversion, but with slight variations:
from bs4 import BeautifulSoup
html_with_entities = "<p>Price: $5 & up</p>"
# Using different parsers
soup_html = BeautifulSoup(html_with_entities, 'html.parser')
soup_lxml = BeautifulSoup(html_with_entities, 'lxml')
soup_html5lib = BeautifulSoup(html_with_entities, 'html5lib')
# All produce the same result
print(soup_html.get_text()) # Price: $5 & up
print(soup_lxml.get_text()) # Price: $5 & up
print(soup_html5lib.get_text()) # Price: $5 & up
Working with Attributes
Entity conversion also applies to attribute values:
html_with_attr_entities = '<a href="search.php?q=cats&dogs" title="Cats & Dogs">Link</a>'
soup = BeautifulSoup(html_with_attr_entities, 'html.parser')
link = soup.find('a')
print(f"URL: {link['href']}") # URL: search.php?q=cats&dogs
print(f"Title: {link['title']}") # Title: Cats & Dogs
Manual Entity Handling
For cases where you need to manually encode or decode entities, use Python's built-in html
module:
import html
# Encoding (converting characters to entities)
text = "Johnson & Johnson's price: $5 < $10"
encoded = html.escape(text)
print(encoded) # Johnson & Johnson's price: $5 < $10
# Decoding (converting entities to characters)
entity_text = "Johnson & Johnson's price: $5 < $10"
decoded = html.unescape(entity_text)
print(decoded) # Johnson & Johnson's price: $5 < $10
Real-World Example
Here's a practical example extracting product information that contains entities:
from bs4 import BeautifulSoup
# Sample e-commerce HTML
html_content = """
<div class="product">
<h2>Johnson & Johnson Baby Shampoo</h2>
<p class="price">Price: $12.99 & up</p>
<p class="description">Gentle formula - "No more tears" © 2024</p>
</div>
"""
soup = BeautifulSoup(html_content, 'html.parser')
# Extract product details - entities automatically converted
product_name = soup.find('h2').get_text()
price = soup.find('p', class_='price').get_text()
description = soup.find('p', class_='description').get_text()
print(f"Product: {product_name}")
print(f"Price: {price}")
print(f"Description: {description}")
# Output:
# Product: Johnson & Johnson Baby Shampoo
# Price: Price: $12.99 & up
# Description: Gentle formula - "No more tears" © 2024
Installation
To ensure you have the latest version of Beautiful Soup:
pip install beautifulsoup4
# or to upgrade
pip install --upgrade beautifulsoup4
Beautiful Soup's automatic entity conversion makes it effortless to extract clean, readable text from HTML documents without worrying about HTML encoding issues.