Table of contents

How do I strip tags from an HTML document using lxml?

The lxml library provides several methods to strip HTML tags from documents. Here are the most common approaches for removing tags while preserving text content.

Method 1: Using drop_tag() (Recommended)

The most straightforward approach uses the drop_tag() method to remove tags while keeping their content:

from lxml import html

def strip_all_tags(html_content):
    """Remove all HTML tags while preserving text content."""
    # Parse the HTML content
    parser = html.HTMLParser(remove_comments=True)
    document = html.fromstring(html_content, parser=parser)

    # Get all elements and strip tags in reverse order to avoid issues
    elements = document.xpath('.//*')
    for element in reversed(elements):
        if element is not document:
            element.drop_tag()

    # Return clean text
    return html.tostring(document, encoding='unicode', method='text')

# Example usage
html_data = '''
<html>
    <body>
        <p>This is a <a href="http://example.com">link</a>.</p>
        <div>And this is a <span style="color: red;">red text</span>.</div>
    </body>
</html>
'''

clean_text = strip_all_tags(html_data)
print(clean_text)
# Output: This is a link.
#         And this is a red text.

Method 2: Extracting Text Only

For simple text extraction without HTML structure:

from lxml import html

def extract_text_only(html_content):
    """Extract only the text content from HTML."""
    document = html.fromstring(html_content)
    return document.text_content()

# Example usage
html_data = '<p>Hello <strong>world</strong>!</p>'
text = extract_text_only(html_data)
print(text)  # Output: Hello world!

Method 3: Removing Specific Tags

To remove only certain tags while keeping others:

from lxml import html

def strip_specific_tags(html_content, tags_to_remove):
    """Remove specific HTML tags while keeping others."""
    parser = html.HTMLParser(remove_comments=True)
    document = html.fromstring(html_content, parser=parser)

    # Find and remove specific tags
    for tag in tags_to_remove:
        for element in document.xpath(f'.//{tag}'):
            element.drop_tag()

    return html.tostring(document, encoding='unicode')

# Example usage
html_data = '''
<div>
    <p>Keep this paragraph</p>
    <span>Remove this span</span>
    <a href="#">Remove this link</a>
</div>
'''

# Remove only span and anchor tags
cleaned = strip_specific_tags(html_data, ['span', 'a'])
print(cleaned)
# Output: <div><p>Keep this paragraph</p>Remove this spanRemove this link</div>

Method 4: Preserving Structure with Whitespace

For better text formatting when stripping tags:

from lxml import html
import re

def strip_tags_preserve_spacing(html_content):
    """Strip tags while preserving readable spacing."""
    # Add newlines before block elements
    block_elements = ['div', 'p', 'br', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']

    for tag in block_elements:
        html_content = re.sub(f'<{tag}[^>]*>', f'\n<{tag}>', html_content)

    # Parse and extract text
    document = html.fromstring(html_content)
    text = document.text_content()

    # Clean up extra whitespace
    text = re.sub(r'\n\s*\n', '\n\n', text)  # Replace multiple newlines
    text = re.sub(r'[ \t]+', ' ', text)      # Replace multiple spaces

    return text.strip()

# Example usage
html_data = '''
<div>
    <h1>Title</h1>
    <p>First paragraph</p>
    <p>Second paragraph</p>
</div>
'''

formatted_text = strip_tags_preserve_spacing(html_data)
print(formatted_text)

Method 5: Using XPath for Complex Scenarios

For advanced tag removal with conditions:

from lxml import html

def strip_tags_with_conditions(html_content):
    """Remove tags based on specific conditions."""
    document = html.fromstring(html_content)

    # Remove all tags except links and emphasis
    for element in document.xpath('.//*[not(self::a or self::em or self::strong)]'):
        if element is not document:
            element.drop_tag()

    return html.tostring(document, encoding='unicode')

# Example usage
html_data = '''
<div>
    <p>This is <em>important</em> text with a <a href="#">link</a>.</p>
    <span>This span will be removed</span>
</div>
'''

result = strip_tags_with_conditions(html_data)
print(result)
# Output: This is <em>important</em> text with a <a href="#">link</a>.This span will be removed

Best Practices

  1. Handle malformed HTML: Use HTMLParser with error recovery
  2. Process in reverse order: When removing multiple elements, iterate in reverse to avoid index issues
  3. Preserve whitespace: Consider adding newlines before block elements for better readability
  4. Test edge cases: Handle empty elements, nested tags, and special characters
  5. Use appropriate method: Choose text_content() for simple text extraction, drop_tag() for preserving structure

Common Pitfalls

  • Lost whitespace: Text from adjacent elements may run together
  • Broken HTML: Malformed input can cause parsing errors
  • Memory usage: Large documents should be processed in chunks
  • Encoding issues: Always specify encoding when converting back to strings

Choose the method that best fits your specific use case and HTML structure requirements.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon