The lxml
library provides several methods to strip HTML tags from documents. Here are the most common approaches for removing tags while preserving text content.
Method 1: Using drop_tag()
(Recommended)
The most straightforward approach uses the drop_tag()
method to remove tags while keeping their content:
from lxml import html
def strip_all_tags(html_content):
"""Remove all HTML tags while preserving text content."""
# Parse the HTML content
parser = html.HTMLParser(remove_comments=True)
document = html.fromstring(html_content, parser=parser)
# Get all elements and strip tags in reverse order to avoid issues
elements = document.xpath('.//*')
for element in reversed(elements):
if element is not document:
element.drop_tag()
# Return clean text
return html.tostring(document, encoding='unicode', method='text')
# Example usage
html_data = '''
<html>
<body>
<p>This is a <a href="http://example.com">link</a>.</p>
<div>And this is a <span style="color: red;">red text</span>.</div>
</body>
</html>
'''
clean_text = strip_all_tags(html_data)
print(clean_text)
# Output: This is a link.
# And this is a red text.
Method 2: Extracting Text Only
For simple text extraction without HTML structure:
from lxml import html
def extract_text_only(html_content):
"""Extract only the text content from HTML."""
document = html.fromstring(html_content)
return document.text_content()
# Example usage
html_data = '<p>Hello <strong>world</strong>!</p>'
text = extract_text_only(html_data)
print(text) # Output: Hello world!
Method 3: Removing Specific Tags
To remove only certain tags while keeping others:
from lxml import html
def strip_specific_tags(html_content, tags_to_remove):
"""Remove specific HTML tags while keeping others."""
parser = html.HTMLParser(remove_comments=True)
document = html.fromstring(html_content, parser=parser)
# Find and remove specific tags
for tag in tags_to_remove:
for element in document.xpath(f'.//{tag}'):
element.drop_tag()
return html.tostring(document, encoding='unicode')
# Example usage
html_data = '''
<div>
<p>Keep this paragraph</p>
<span>Remove this span</span>
<a href="#">Remove this link</a>
</div>
'''
# Remove only span and anchor tags
cleaned = strip_specific_tags(html_data, ['span', 'a'])
print(cleaned)
# Output: <div><p>Keep this paragraph</p>Remove this spanRemove this link</div>
Method 4: Preserving Structure with Whitespace
For better text formatting when stripping tags:
from lxml import html
import re
def strip_tags_preserve_spacing(html_content):
"""Strip tags while preserving readable spacing."""
# Add newlines before block elements
block_elements = ['div', 'p', 'br', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
for tag in block_elements:
html_content = re.sub(f'<{tag}[^>]*>', f'\n<{tag}>', html_content)
# Parse and extract text
document = html.fromstring(html_content)
text = document.text_content()
# Clean up extra whitespace
text = re.sub(r'\n\s*\n', '\n\n', text) # Replace multiple newlines
text = re.sub(r'[ \t]+', ' ', text) # Replace multiple spaces
return text.strip()
# Example usage
html_data = '''
<div>
<h1>Title</h1>
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
'''
formatted_text = strip_tags_preserve_spacing(html_data)
print(formatted_text)
Method 5: Using XPath for Complex Scenarios
For advanced tag removal with conditions:
from lxml import html
def strip_tags_with_conditions(html_content):
"""Remove tags based on specific conditions."""
document = html.fromstring(html_content)
# Remove all tags except links and emphasis
for element in document.xpath('.//*[not(self::a or self::em or self::strong)]'):
if element is not document:
element.drop_tag()
return html.tostring(document, encoding='unicode')
# Example usage
html_data = '''
<div>
<p>This is <em>important</em> text with a <a href="#">link</a>.</p>
<span>This span will be removed</span>
</div>
'''
result = strip_tags_with_conditions(html_data)
print(result)
# Output: This is <em>important</em> text with a <a href="#">link</a>.This span will be removed
Best Practices
- Handle malformed HTML: Use
HTMLParser
with error recovery - Process in reverse order: When removing multiple elements, iterate in reverse to avoid index issues
- Preserve whitespace: Consider adding newlines before block elements for better readability
- Test edge cases: Handle empty elements, nested tags, and special characters
- Use appropriate method: Choose
text_content()
for simple text extraction,drop_tag()
for preserving structure
Common Pitfalls
- Lost whitespace: Text from adjacent elements may run together
- Broken HTML: Malformed input can cause parsing errors
- Memory usage: Large documents should be processed in chunks
- Encoding issues: Always specify encoding when converting back to strings
Choose the method that best fits your specific use case and HTML structure requirements.