Table of contents

How do I remove tags from an element while keeping its text with Beautiful Soup?

Beautiful Soup provides several methods to remove HTML tags while preserving text content. The approach you choose depends on whether you want to extract all text or selectively remove specific tags.

Method 1: Extract All Text with get_text()

The get_text() method is the most common way to extract all text content from an element, removing all HTML tags:

from bs4 import BeautifulSoup

html_content = '''
<div>
    <p>Hello, <b>World</b>!</p>
    <p>This is <em>emphasized</em> text with <a href="#">links</a>.</p>
</div>
'''

soup = BeautifulSoup(html_content, 'html.parser')

# Extract all text from the div element
div_element = soup.find('div')
text_content = div_element.get_text()
print(text_content)
# Output: Hello, World!This is emphasized text with links.

# Add separator between elements for better formatting
text_with_separator = div_element.get_text(separator=' ')
print(text_with_separator)
# Output: Hello, World! This is emphasized text with links.

# Strip whitespace for cleaner output
clean_text = div_element.get_text(separator=' ', strip=True)
print(clean_text)
# Output: Hello, World! This is emphasized text with links.

Method 2: Remove Specific Tags with extract()

Use extract() to remove specific tags while keeping the rest of the HTML structure intact:

from bs4 import BeautifulSoup

html_content = '''
<div>
    <p>Hello, <b>World</b> <i>and Universe</i>!</p>
    <p>Keep this <em>emphasis</em> but remove <script>alert('bad')</script> scripts.</p>
</div>
'''

soup = BeautifulSoup(html_content, 'html.parser')

# Remove all <b> tags but keep their text content
for b_tag in soup.find_all('b'):
    b_tag.extract()

# Remove all <script> tags completely
for script in soup.find_all('script'):
    script.decompose()  # Completely removes tag and content

print(soup.get_text(separator=' ', strip=True))
# Output: Hello,  and Universe! Keep this emphasis but remove  scripts.

Method 3: Unwrap Tags with unwrap()

The unwrap() method removes a tag but keeps its contents in place:

from bs4 import BeautifulSoup

html_content = '''
<p>This is <b>bold text</b> and <i>italic text</i>.</p>
'''

soup = BeautifulSoup(html_content, 'html.parser')

# Remove <b> tags but keep the text
for b_tag in soup.find_all('b'):
    b_tag.unwrap()

print(soup)
# Output: <p>This is bold text and <i>italic text</i>.</p>

# Get final text content
print(soup.get_text())
# Output: This is bold text and italic text.

Method 4: Using strings and stripped_strings

For more control over text extraction, use the strings generators:

from bs4 import BeautifulSoup

html_content = '''
<div>
    <p>   Hello,   <b>World</b>!   </p>
    <p>Beautiful   <i>Soup</i>   parsing   </p>
</div>
'''

soup = BeautifulSoup(html_content, 'html.parser')
div_element = soup.find('div')

# Using .strings (preserves whitespace)
text_with_whitespace = ''.join(div_element.strings)
print(repr(text_with_whitespace))
# Output: '   Hello,   World!      Beautiful   Soup   parsing   '

# Using .stripped_strings (removes extra whitespace)
clean_text = ' '.join(div_element.stripped_strings)
print(clean_text)
# Output: Hello, World! Beautiful Soup parsing

Advanced Example: Selective Tag Removal

Here's a practical example that removes unwanted tags while preserving content structure:

from bs4 import BeautifulSoup

html_content = '''
<article>
    <h2>Article Title</h2>
    <p>This paragraph has <span style="color: red;">colored text</span> 
       and <a href="http://example.com">a link</a>.</p>
    <p>Remove <script>alert('xss')</script> and <style>body{color:red}</style> 
       but keep <strong>important</strong> content.</p>
</article>
'''

soup = BeautifulSoup(html_content, 'html.parser')

# Remove unwanted tags completely
unwanted_tags = ['script', 'style', 'noscript']
for tag_name in unwanted_tags:
    for tag in soup.find_all(tag_name):
        tag.decompose()

# Unwrap formatting tags to keep text
formatting_tags = ['span', 'font']
for tag_name in formatting_tags:
    for tag in soup.find_all(tag_name):
        tag.unwrap()

# Extract clean text
clean_text = soup.get_text(separator='\n', strip=True)
print(clean_text)
# Output:
# Article Title
# This paragraph has colored text and a link.
# Remove  but keep important content.

Key Differences Between Methods

  • get_text(): Extracts all text content, removing all HTML tags
  • extract(): Removes the tag and returns it; the tag's text content is lost
  • unwrap(): Removes the tag but keeps its text content in place
  • decompose(): Permanently destroys the tag and its content
  • strings / stripped_strings: Provides fine-grained control over text extraction

Choose the method that best fits your specific text extraction needs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon