How do I use regular expressions with Beautiful Soup?

Beautiful Soup allows you to use regular expressions for powerful pattern matching when searching HTML and XML documents. This is particularly useful when you need to find elements that match complex patterns rather than exact values.

Prerequisites

Install Beautiful Soup and a parser:

pip install beautifulsoup4
pip install lxml  # Recommended parser (or use built-in 'html.parser')

Basic Setup

from bs4 import BeautifulSoup
import re

# Sample HTML for examples
html_doc = """
<html>
<head>
    <title>Web Scraping Tutorial</title>
</head>
<body>
    <div class="content-main">Main content</div>
    <div class="content-sidebar">Sidebar content</div>
    <span class="highlight-red">Important text</span>
    <span class="highlight-blue">Another highlight</span>
    <a href="https://example.com/page1">Link 1</a>
    <a href="https://example.com/page2">Link 2</a>
    <p id="para1">First paragraph</p>
    <p id="para2">Second paragraph</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

Finding Tags by Name Pattern

Use regex to find tags whose names match specific patterns:

# Find all tags starting with 'p' (p, pre, param, etc.)
p_tags = soup.find_all(re.compile(r'^p'))
print([tag.name for tag in p_tags])  # ['p', 'p']

# Find all heading tags (h1, h2, h3, etc.)
heading_tags = soup.find_all(re.compile(r'^h[1-6]$'))

# Find tags ending with specific pattern
div_span_tags = soup.find_all(re.compile(r'(div|span)$'))

Searching Attributes with Regex

Find elements based on attribute patterns:

# Find elements with class names starting with 'content'
content_divs = soup.find_all('div', class_=re.compile(r'^content'))
for div in content_divs:
    print(div.get('class'))  # ['content-main'], ['content-sidebar']

# Find elements with class names containing 'highlight'
highlights = soup.find_all(attrs={'class': re.compile(r'highlight')})

# Find links with specific URL patterns
external_links = soup.find_all('a', href=re.compile(r'^https://'))
for link in external_links:
    print(link.get('href'))

# Find elements with IDs matching a pattern
paragraphs = soup.find_all(attrs={'id': re.compile(r'^para\d+$')})

Searching Text Content

Use regex to find text that matches specific patterns:

# Find text containing specific words
tutorial_text = soup.find_all(string=re.compile(r'Tutorial'))
for text in tutorial_text:
    print(text.strip())

# Find text matching exact patterns
important_text = soup.find_all(string=re.compile(r'Important.*text'))

# Case-insensitive search
case_insensitive = soup.find_all(string=re.compile(r'CONTENT', re.IGNORECASE))

Advanced Regex Examples

Email Pattern Matching

html_with_emails = """
<div>
    <p>Contact us at support@example.com</p>
    <p>Sales: sales@company.org</p>
    <p>Invalid email: not-an-email</p>
</div>
"""

soup = BeautifulSoup(html_with_emails, 'lxml')
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = soup.find_all(string=email_pattern)
for email in emails:
    print(email.strip())

Phone Number Extraction

html_with_phones = """
<div>
    <p>Call us: (555) 123-4567</p>
    <p>Mobile: 555-987-6543</p>
    <p>International: +1-555-555-5555</p>
</div>
"""

soup = BeautifulSoup(html_with_phones, 'lxml')
phone_pattern = re.compile(r'(\+\d{1,3}[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}')
phones = soup.find_all(string=phone_pattern)
for phone in phones:
    print(phone.strip())

Complex Attribute Matching

html_complex = """
<div data-id="user-123" data-type="admin">Admin User</div>
<div data-id="user-456" data-type="regular">Regular User</div>
<div data-id="product-789" data-type="featured">Product</div>
"""

soup = BeautifulSoup(html_complex, 'lxml')

# Find user elements (data-id starting with 'user-')
users = soup.find_all(attrs={'data-id': re.compile(r'^user-\d+$')})
for user in users:
    print(f"ID: {user.get('data-id')}, Type: {user.get('data-type')}")

Combining Multiple Conditions

# Find divs with class containing 'content' AND specific text
content_divs = soup.find_all(
    'div', 
    class_=re.compile(r'content'),
    string=re.compile(r'content', re.IGNORECASE)
)

# Custom function for complex matching
def complex_match(tag):
    return (tag.name == 'span' and 
            tag.get('class') and 
            re.search(r'highlight', ' '.join(tag.get('class'))))

complex_elements = soup.find_all(complex_match)

Performance Considerations

# Pre-compile regex patterns for better performance
CLASS_PATTERN = re.compile(r'^content-')
ID_PATTERN = re.compile(r'^para\d+$')

# Reuse compiled patterns
content_elements = soup.find_all(class_=CLASS_PATTERN)
paragraph_elements = soup.find_all(attrs={'id': ID_PATTERN})

Best Practices

  1. Pre-compile patterns: Use re.compile() for patterns used multiple times
  2. Be specific: Narrow patterns perform better than broad ones
  3. Consider alternatives: Simple string methods might be faster for basic searches
  4. Use raw strings: Always use r'' for regex patterns to avoid escape issues
  5. Test patterns: Verify your regex works with edge cases

Common Patterns

# URLs
URL_PATTERN = re.compile(r'https?://[^\s<>"]{2,}')

# CSS classes with prefixes
CSS_PREFIX = re.compile(r'^(btn|nav|header|footer)-')

# Numeric IDs
NUMERIC_ID = re.compile(r'^\d+$')

# Date patterns (YYYY-MM-DD)
DATE_PATTERN = re.compile(r'\d{4}-\d{2}-\d{2}')

# HTML tags
TAG_PATTERN = re.compile(r'^(div|span|p|a)$')

Regular expressions with Beautiful Soup provide powerful pattern matching capabilities for complex web scraping tasks. Use them when simple string matching isn't sufficient, but remember that they can be slower than basic searches for simple use cases.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon