Parsing HTML or XML documents with Beautiful Soup is a common task in web scraping and data extraction. Beautiful Soup is a Python library designed for quick turnaround projects like screen scraping.
Installation
Before you start, you need to install the beautifulsoup4
package along with a parser like lxml
or html.parser
(the latter is built into Python). You can install Beautiful Soup and lxml with pip:
pip install beautifulsoup4 lxml
Basic Usage
Here is a step-by-step guide on how to parse HTML or XML documents with Beautiful Soup.
- Import the library and choose a parser:
from bs4 import BeautifulSoup
# Choose a parser, for example 'lxml'
# You can also use 'html.parser' for HTML documents
- Load your document into Beautiful Soup:
# Assuming `html_doc` is a variable containing your HTML as a string
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
...
</body>
</html>
"""
# Parse the HTML document with Beautiful Soup
soup = BeautifulSoup(html_doc, 'lxml') # or 'html.parser'
- Navigate the parse tree using Beautiful Soup's methods and properties:
# Find the title tag
title_tag = soup.title
print(title_tag) # <title>The Dormouse's story</title>
# Get the text within the title tag
print(title_tag.string) # The Dormouse's story
- Search the tree with methods like
find()
andfind_all()
:
# Find the first <p> tag with the class "title"
first_p_tag = soup.find('p', class_='title')
print(first_p_tag) # <p class="title"><b>The Dormouse's story</b></p>
# Find all <a> tags
all_a_tags = soup.find_all('a')
for tag in all_a_tags:
print(tag)
- Modify the tree if needed (like adding, altering, or deleting tags and strings):
# Add a new element
new_tag = soup.new_tag('a', href='http://www.example.com')
new_tag.string = 'Example Link'
soup.body.append(new_tag)
# Modify an element
first_p_tag['class'] = 'newClass'
# Remove an element
unwanted_tag = soup.find('div', id='unwanted')
unwanted_tag.decompose()
- Output the modified HTML/XML:
print(soup.prettify())
Advanced Queries
Beautiful Soup allows for advanced CSS selector queries similar to those in JavaScript. Use the .select()
method to find elements by CSS selectors:
# Find elements with the 'title' class
titles = soup.select('.title')
for title in titles:
print(title.get_text())
# Find all <a> tags within <div> tags
a_within_divs = soup.select('div a')
for a in a_within_divs:
print(a['href'])
Remember that Beautiful Soup handles the parsing and navigation of the HTML/XML documents. To actually retrieve the documents from the web, you would typically use libraries such as requests
in Python.
Here's an example of fetching a webpage and then parsing it with Beautiful Soup:
import requests
from bs4 import BeautifulSoup
# Fetch the webpage
response = requests.get('http://example.com')
# Check for successful request
if response.status_code == 200:
# Parse the content of the request with Beautiful Soup
soup = BeautifulSoup(response.content, 'lxml')
# Now you can work with `soup` as shown above
Beautiful Soup is a versatile library that can handle most HTML/XML parsing tasks with ease. Its simplicity and power make it a go-to tool for developers working on web scraping and data extraction projects.