Beautiful Soup allows you to use regular expressions for powerful pattern matching when searching HTML and XML documents. This is particularly useful when you need to find elements that match complex patterns rather than exact values.
Prerequisites
Install Beautiful Soup and a parser:
pip install beautifulsoup4
pip install lxml # Recommended parser (or use built-in 'html.parser')
Basic Setup
from bs4 import BeautifulSoup
import re
# Sample HTML for examples
html_doc = """
<html>
<head>
<title>Web Scraping Tutorial</title>
</head>
<body>
<div class="content-main">Main content</div>
<div class="content-sidebar">Sidebar content</div>
<span class="highlight-red">Important text</span>
<span class="highlight-blue">Another highlight</span>
<a href="https://example.com/page1">Link 1</a>
<a href="https://example.com/page2">Link 2</a>
<p id="para1">First paragraph</p>
<p id="para2">Second paragraph</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'lxml')
Finding Tags by Name Pattern
Use regex to find tags whose names match specific patterns:
# Find all tags starting with 'p' (p, pre, param, etc.)
p_tags = soup.find_all(re.compile(r'^p'))
print([tag.name for tag in p_tags]) # ['p', 'p']
# Find all heading tags (h1, h2, h3, etc.)
heading_tags = soup.find_all(re.compile(r'^h[1-6]$'))
# Find tags ending with specific pattern
div_span_tags = soup.find_all(re.compile(r'(div|span)$'))
Searching Attributes with Regex
Find elements based on attribute patterns:
# Find elements with class names starting with 'content'
content_divs = soup.find_all('div', class_=re.compile(r'^content'))
for div in content_divs:
print(div.get('class')) # ['content-main'], ['content-sidebar']
# Find elements with class names containing 'highlight'
highlights = soup.find_all(attrs={'class': re.compile(r'highlight')})
# Find links with specific URL patterns
external_links = soup.find_all('a', href=re.compile(r'^https://'))
for link in external_links:
print(link.get('href'))
# Find elements with IDs matching a pattern
paragraphs = soup.find_all(attrs={'id': re.compile(r'^para\d+$')})
Searching Text Content
Use regex to find text that matches specific patterns:
# Find text containing specific words
tutorial_text = soup.find_all(string=re.compile(r'Tutorial'))
for text in tutorial_text:
print(text.strip())
# Find text matching exact patterns
important_text = soup.find_all(string=re.compile(r'Important.*text'))
# Case-insensitive search
case_insensitive = soup.find_all(string=re.compile(r'CONTENT', re.IGNORECASE))
Advanced Regex Examples
Email Pattern Matching
html_with_emails = """
<div>
<p>Contact us at support@example.com</p>
<p>Sales: sales@company.org</p>
<p>Invalid email: not-an-email</p>
</div>
"""
soup = BeautifulSoup(html_with_emails, 'lxml')
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = soup.find_all(string=email_pattern)
for email in emails:
print(email.strip())
Phone Number Extraction
html_with_phones = """
<div>
<p>Call us: (555) 123-4567</p>
<p>Mobile: 555-987-6543</p>
<p>International: +1-555-555-5555</p>
</div>
"""
soup = BeautifulSoup(html_with_phones, 'lxml')
phone_pattern = re.compile(r'(\+\d{1,3}[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}')
phones = soup.find_all(string=phone_pattern)
for phone in phones:
print(phone.strip())
Complex Attribute Matching
html_complex = """
<div data-id="user-123" data-type="admin">Admin User</div>
<div data-id="user-456" data-type="regular">Regular User</div>
<div data-id="product-789" data-type="featured">Product</div>
"""
soup = BeautifulSoup(html_complex, 'lxml')
# Find user elements (data-id starting with 'user-')
users = soup.find_all(attrs={'data-id': re.compile(r'^user-\d+$')})
for user in users:
print(f"ID: {user.get('data-id')}, Type: {user.get('data-type')}")
Combining Multiple Conditions
# Find divs with class containing 'content' AND specific text
content_divs = soup.find_all(
'div',
class_=re.compile(r'content'),
string=re.compile(r'content', re.IGNORECASE)
)
# Custom function for complex matching
def complex_match(tag):
return (tag.name == 'span' and
tag.get('class') and
re.search(r'highlight', ' '.join(tag.get('class'))))
complex_elements = soup.find_all(complex_match)
Performance Considerations
# Pre-compile regex patterns for better performance
CLASS_PATTERN = re.compile(r'^content-')
ID_PATTERN = re.compile(r'^para\d+$')
# Reuse compiled patterns
content_elements = soup.find_all(class_=CLASS_PATTERN)
paragraph_elements = soup.find_all(attrs={'id': ID_PATTERN})
Best Practices
- Pre-compile patterns: Use
re.compile()
for patterns used multiple times - Be specific: Narrow patterns perform better than broad ones
- Consider alternatives: Simple string methods might be faster for basic searches
- Use raw strings: Always use
r''
for regex patterns to avoid escape issues - Test patterns: Verify your regex works with edge cases
Common Patterns
# URLs
URL_PATTERN = re.compile(r'https?://[^\s<>"]{2,}')
# CSS classes with prefixes
CSS_PREFIX = re.compile(r'^(btn|nav|header|footer)-')
# Numeric IDs
NUMERIC_ID = re.compile(r'^\d+$')
# Date patterns (YYYY-MM-DD)
DATE_PATTERN = re.compile(r'\d{4}-\d{2}-\d{2}')
# HTML tags
TAG_PATTERN = re.compile(r'^(div|span|p|a)$')
Regular expressions with Beautiful Soup provide powerful pattern matching capabilities for complex web scraping tasks. Use them when simple string matching isn't sufficient, but remember that they can be slower than basic searches for simple use cases.