Yes, Beautiful Soup automatically converts entity references (like &
, <
, >
, etc.) to the corresponding Unicode characters when it parses an HTML document. This means that when you access the text in a parsed document, you'll see the actual characters rather than the entity references.
For example, if an HTML document contains &
as part of the text, Beautiful Soup will convert it to an ampersand (&
) character in the output.
Here is a Python example using Beautiful Soup:
from bs4 import BeautifulSoup
html_doc = "The ampersand symbol is written as & in HTML."
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.text) # Output: The ampersand symbol is written as & in HTML.
In this code snippet, the BeautifulSoup
object converts the &
entity reference in the HTML document to a plain ampersand (&
) in the output.
Beautiful Soup uses the html.parser
included in the Python standard library by default to parse HTML documents. If you have lxml
or html5lib
installed, you can also specify these as the parser to use different parsing strategies.
To illustrate the parsing with a different library:
soup_lxml = BeautifulSoup(html_doc, 'lxml')
print(soup_lxml.text) # Output: The ampersand symbol is written as & in HTML.
Just like with the standard html.parser
, both lxml
and html5lib
will handle entity conversion for you.
If you ever need to encode or decode entities manually, you might want to use Python's built-in html
module:
import html
# To encode entities:
encoded = html.escape("This & that")
print(encoded) # Output: This & that
# To decode entities:
decoded = html.unescape("This & that")
print(decoded) # Output: This & that
Remember to always use the latest version of Beautiful Soup to ensure the best compatibility and performance. You can install or upgrade Beautiful Soup with pip
:
pip install beautifulsoup4
# or to upgrade
pip install --upgrade beautifulsoup4
Keep in mind that Beautiful Soup is not typically used in JavaScript; for web scraping in a JavaScript environment, you would use libraries like cheerio
or jsdom
. These libraries also handle HTML entity decoding automatically.