How can I parse HTML or XML documents with Beautiful Soup?

Parsing HTML or XML documents with Beautiful Soup is a common task in web scraping and data extraction. Beautiful Soup is a Python library designed for quick turnaround projects like screen scraping.

Installation

Before you start, you need to install the beautifulsoup4 package along with a parser like lxml or html.parser (the latter is built into Python). You can install Beautiful Soup and lxml with pip:

pip install beautifulsoup4 lxml

Basic Usage

Here is a step-by-step guide on how to parse HTML or XML documents with Beautiful Soup.

Import the library and choose a parser:

   from bs4 import BeautifulSoup

   # Choose a parser, for example 'lxml'
   # You can also use 'html.parser' for HTML documents

Load your document into Beautiful Soup:

   # Assuming `html_doc` is a variable containing your HTML as a string
   html_doc = """
   <html>
       <head><title>The Dormouse's story</title></head>
       <body>
       <p class="title"><b>The Dormouse's story</b></p>
       ...
       </body>
   </html>
   """

   # Parse the HTML document with Beautiful Soup
   soup = BeautifulSoup(html_doc, 'lxml')  # or 'html.parser'

Navigate the parse tree using Beautiful Soup's methods and properties:

   # Find the title tag
   title_tag = soup.title
   print(title_tag)  # <title>The Dormouse's story</title>

   # Get the text within the title tag
   print(title_tag.string)  # The Dormouse's story

Search the tree with methods like find() and find_all():

   # Find the first <p> tag with the class "title"
   first_p_tag = soup.find('p', class_='title')
   print(first_p_tag)  # <p class="title"><b>The Dormouse's story</b></p>

   # Find all <a> tags
   all_a_tags = soup.find_all('a')
   for tag in all_a_tags:
       print(tag)

Modify the tree if needed (like adding, altering, or deleting tags and strings):

   # Add a new element
   new_tag = soup.new_tag('a', href='http://www.example.com')
   new_tag.string = 'Example Link'
   soup.body.append(new_tag)

   # Modify an element
   first_p_tag['class'] = 'newClass'

   # Remove an element
   unwanted_tag = soup.find('div', id='unwanted')
   unwanted_tag.decompose()

Output the modified HTML/XML:

   print(soup.prettify())

Advanced Queries

Beautiful Soup allows for advanced CSS selector queries similar to those in JavaScript. Use the .select() method to find elements by CSS selectors:

# Find elements with the 'title' class
titles = soup.select('.title')
for title in titles:
    print(title.get_text())

# Find all <a> tags within <div> tags
a_within_divs = soup.select('div a')
for a in a_within_divs:
    print(a['href'])

Remember that Beautiful Soup handles the parsing and navigation of the HTML/XML documents. To actually retrieve the documents from the web, you would typically use libraries such as requests in Python.

Here's an example of fetching a webpage and then parsing it with Beautiful Soup:

import requests
from bs4 import BeautifulSoup

# Fetch the webpage
response = requests.get('http://example.com')

# Check for successful request
if response.status_code == 200:
    # Parse the content of the request with Beautiful Soup
    soup = BeautifulSoup(response.content, 'lxml')

    # Now you can work with `soup` as shown above

Beautiful Soup is a versatile library that can handle most HTML/XML parsing tasks with ease. Its simplicity and power make it a go-to tool for developers working on web scraping and data extraction projects.

How can I parse HTML or XML documents with Beautiful Soup?

Installation

Basic Usage

Advanced Queries

Related Questions

What are the main methods provided by Beautiful Soup for navigating the parse tree?

How can I find all the links on a webpage using Beautiful Soup?

Can Beautiful Soup be used with both Python 2 and Python 3?

Get Started Now