Beautiful Soup provides several methods to remove HTML tags while preserving text content. The approach you choose depends on whether you want to extract all text or selectively remove specific tags.
Method 1: Extract All Text with get_text()
The get_text()
method is the most common way to extract all text content from an element, removing all HTML tags:
from bs4 import BeautifulSoup
html_content = '''
<div>
<p>Hello, <b>World</b>!</p>
<p>This is <em>emphasized</em> text with <a href="#">links</a>.</p>
</div>
'''
soup = BeautifulSoup(html_content, 'html.parser')
# Extract all text from the div element
div_element = soup.find('div')
text_content = div_element.get_text()
print(text_content)
# Output: Hello, World!This is emphasized text with links.
# Add separator between elements for better formatting
text_with_separator = div_element.get_text(separator=' ')
print(text_with_separator)
# Output: Hello, World! This is emphasized text with links.
# Strip whitespace for cleaner output
clean_text = div_element.get_text(separator=' ', strip=True)
print(clean_text)
# Output: Hello, World! This is emphasized text with links.
Method 2: Remove Specific Tags with extract()
Use extract()
to remove specific tags while keeping the rest of the HTML structure intact:
from bs4 import BeautifulSoup
html_content = '''
<div>
<p>Hello, <b>World</b> <i>and Universe</i>!</p>
<p>Keep this <em>emphasis</em> but remove <script>alert('bad')</script> scripts.</p>
</div>
'''
soup = BeautifulSoup(html_content, 'html.parser')
# Remove all <b> tags but keep their text content
for b_tag in soup.find_all('b'):
b_tag.extract()
# Remove all <script> tags completely
for script in soup.find_all('script'):
script.decompose() # Completely removes tag and content
print(soup.get_text(separator=' ', strip=True))
# Output: Hello, and Universe! Keep this emphasis but remove scripts.
Method 3: Unwrap Tags with unwrap()
The unwrap()
method removes a tag but keeps its contents in place:
from bs4 import BeautifulSoup
html_content = '''
<p>This is <b>bold text</b> and <i>italic text</i>.</p>
'''
soup = BeautifulSoup(html_content, 'html.parser')
# Remove <b> tags but keep the text
for b_tag in soup.find_all('b'):
b_tag.unwrap()
print(soup)
# Output: <p>This is bold text and <i>italic text</i>.</p>
# Get final text content
print(soup.get_text())
# Output: This is bold text and italic text.
Method 4: Using strings
and stripped_strings
For more control over text extraction, use the strings
generators:
from bs4 import BeautifulSoup
html_content = '''
<div>
<p> Hello, <b>World</b>! </p>
<p>Beautiful <i>Soup</i> parsing </p>
</div>
'''
soup = BeautifulSoup(html_content, 'html.parser')
div_element = soup.find('div')
# Using .strings (preserves whitespace)
text_with_whitespace = ''.join(div_element.strings)
print(repr(text_with_whitespace))
# Output: ' Hello, World! Beautiful Soup parsing '
# Using .stripped_strings (removes extra whitespace)
clean_text = ' '.join(div_element.stripped_strings)
print(clean_text)
# Output: Hello, World! Beautiful Soup parsing
Advanced Example: Selective Tag Removal
Here's a practical example that removes unwanted tags while preserving content structure:
from bs4 import BeautifulSoup
html_content = '''
<article>
<h2>Article Title</h2>
<p>This paragraph has <span style="color: red;">colored text</span>
and <a href="http://example.com">a link</a>.</p>
<p>Remove <script>alert('xss')</script> and <style>body{color:red}</style>
but keep <strong>important</strong> content.</p>
</article>
'''
soup = BeautifulSoup(html_content, 'html.parser')
# Remove unwanted tags completely
unwanted_tags = ['script', 'style', 'noscript']
for tag_name in unwanted_tags:
for tag in soup.find_all(tag_name):
tag.decompose()
# Unwrap formatting tags to keep text
formatting_tags = ['span', 'font']
for tag_name in formatting_tags:
for tag in soup.find_all(tag_name):
tag.unwrap()
# Extract clean text
clean_text = soup.get_text(separator='\n', strip=True)
print(clean_text)
# Output:
# Article Title
# This paragraph has colored text and a link.
# Remove but keep important content.
Key Differences Between Methods
get_text()
: Extracts all text content, removing all HTML tagsextract()
: Removes the tag and returns it; the tag's text content is lostunwrap()
: Removes the tag but keeps its text content in placedecompose()
: Permanently destroys the tag and its contentstrings
/stripped_strings
: Provides fine-grained control over text extraction
Choose the method that best fits your specific text extraction needs.