How do I extract text content while preserving line breaks?

When scraping web content, preserving the original line breaks and formatting is crucial for maintaining readability and structure. Simple HTML DOM parser provides several methods to extract text while keeping line breaks intact, but the approach depends on how the HTML is structured and what output format you need.

Understanding HTML Line Breaks

HTML represents line breaks in different ways: - <br> tags for single line breaks - <p> tags for paragraph breaks - Block-level elements like <div>, <h1>-<h6> that naturally create line breaks - Whitespace characters that may be collapsed by browsers

Basic Text Extraction with Simple HTML DOM

Method 1: Using innertext Property

The innertext property in Simple HTML DOM automatically preserves some formatting:

<?php
require_once 'simple_html_dom.php';

$html = '<div>
    <p>First paragraph with text.</p>
    <p>Second paragraph<br>with a line break.</p>
    <div>Third section in a div.</div>
</div>';

$dom = str_get_html($html);
$content = $dom->find('div', 0)->innertext;

// This preserves HTML tags, including <br> and <p>
echo $content;
?>

Method 2: Converting HTML to Text with Line Breaks

To extract plain text while preserving line breaks, you need to convert HTML elements to text equivalents:

<?php
function htmlToTextWithLineBreaks($html) {
    // Replace <br> tags with newlines
    $html = preg_replace('/<br\s*\/?>/', "\n", $html);

    // Replace paragraph and div endings with double newlines
    $html = preg_replace('/<\/p>/i', "\n\n", $html);
    $html = preg_replace('/<\/div>/i', "\n", $html);

    // Replace heading endings with newlines
    $html = preg_replace('/<\/h[1-6]>/i', "\n", $html);

    // Remove all remaining HTML tags
    $text = strip_tags($html);

    // Clean up multiple consecutive newlines
    $text = preg_replace('/\n{3,}/', "\n\n", $text);

    // Trim whitespace
    return trim($text);
}

$html = '<div>
    <h2>Article Title</h2>
    <p>First paragraph of content.</p>
    <p>Second paragraph with<br>a line break in the middle.</p>
    <div>Additional content in a div.</div>
</div>';

$dom = str_get_html($html);
$element = $dom->find('div', 0);
$text = htmlToTextWithLineBreaks($element->innertext);

echo $text;
/*
Output:
Article Title

First paragraph of content.

Second paragraph with
a line break in the middle.

Additional content in a div.
*/
?>

Advanced Text Extraction Techniques

Preserving Specific Formatting Elements

Sometimes you want to preserve certain formatting while converting others:

<?php
function extractFormattedText($element) {
    $html = $element->innertext;

    // Replace different HTML elements with appropriate text formatting
    $replacements = [
        '/<br\s*\/?>/' => "\n",
        '/<\/p>/' => "\n\n",
        '/<\/div>/' => "\n",
        '/<\/h[1-6]>/' => "\n",
        '/<\/li>/' => "\n",
        '/<ul[^>]*>/' => "\n",
        '/<\/ul>/' => "\n",
        '/<ol[^>]*>/' => "\n",
        '/<\/ol>/' => "\n",
        '/<strong[^>]*>(.*?)<\/strong>/i' => '**$1**',
        '/<b[^>]*>(.*?)<\/b>/i' => '**$1**',
        '/<em[^>]*>(.*?)<\/em>/i' => '*$1*',
        '/<i[^>]*>(.*?)<\/i>/i' => '*$1*',
    ];

    foreach ($replacements as $pattern => $replacement) {
        $html = preg_replace($pattern, $replacement, $html);
    }

    // Remove remaining HTML tags
    $text = strip_tags($html);

    // Clean up whitespace
    $text = preg_replace('/[ \t]+/', ' ', $text);
    $text = preg_replace('/\n\s+/', "\n", $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);

    return trim($text);
}

$html = '<article>
    <h1>Main Title</h1>
    <p>This is <strong>important text</strong> with <em>emphasis</em>.</p>
    <ul>
        <li>First list item</li>
        <li>Second list item</li>
    </ul>
    <p>Final paragraph with<br>a line break.</p>
</article>';

$dom = str_get_html($html);
$article = $dom->find('article', 0);
$formattedText = extractFormattedText($article);

echo $formattedText;
/*
Output:
Main Title

This is **important text** with *emphasis*.

First list item
Second list item

Final paragraph with
a line break.
*/
?>

Working with Multiple Elements

When extracting text from multiple elements while preserving structure:

<?php
function extractMultipleElementsText($dom, $selector) {
    $elements = $dom->find($selector);
    $textContent = [];

    foreach ($elements as $element) {
        $text = htmlToTextWithLineBreaks($element->innertext);
        if (!empty(trim($text))) {
            $textContent[] = $text;
        }
    }

    return implode("\n\n---\n\n", $textContent);
}

$html = '<div class="content">
    <div class="section">
        <h3>Section 1</h3>
        <p>Content for section one.</p>
    </div>
    <div class="section">
        <h3>Section 2</h3>
        <p>Content for section two<br>with a line break.</p>
    </div>
</div>';

$dom = str_get_html($html);
$sectionsText = extractMultipleElementsText($dom, '.section');
echo $sectionsText;
?>

Alternative Approaches with Other Tools

Using JavaScript with Puppeteer

For JavaScript-heavy pages, using Puppeteer for dynamic content extraction can be more effective:

const puppeteer = require('puppeteer');

async function extractTextWithLineBreaks(url, selector) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto(url);

    const text = await page.evaluate((sel) => {
        const element = document.querySelector(sel);
        if (!element) return '';

        // Get text content while preserving line breaks
        let text = element.innerHTML;

        // Replace HTML elements with text equivalents
        text = text.replace(/<br\s*\/?>/gi, '\n');
        text = text.replace(/<\/p>/gi, '\n\n');
        text = text.replace(/<\/div>/gi, '\n');
        text = text.replace(/<\/h[1-6]>/gi, '\n');

        // Remove remaining HTML tags
        text = text.replace(/<[^>]+>/g, '');

        // Clean up whitespace
        text = text.replace(/\s+/g, ' ');
        text = text.replace(/\n\s+/g, '\n');
        text = text.replace(/\n{3,}/g, '\n\n');

        return text.trim();
    }, selector);

    await browser.close();
    return text;
}

// Usage
extractTextWithLineBreaks('https://example.com', '.article-content')
    .then(text => console.log(text));

Using Python with BeautifulSoup

from bs4 import BeautifulSoup
import re

def extract_text_with_breaks(html, selector=None):
    soup = BeautifulSoup(html, 'html.parser')

    if selector:
        element = soup.select_one(selector)
        if not element:
            return ""
    else:
        element = soup

    # Replace line break elements with actual newlines
    for br in element.find_all('br'):
        br.replace_with('\n')

    # Add newlines after block elements
    for tag in element.find_all(['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
        tag.append('\n')

    # Extract text
    text = element.get_text()

    # Clean up whitespace
    text = re.sub(r'\n\s+', '\n', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r'[ \t]+', ' ', text)

    return text.strip()

# Usage
html = '''
<div class="content">
    <h2>Title</h2>
    <p>First paragraph.</p>
    <p>Second paragraph with<br>line break.</p>
</div>
'''

text = extract_text_with_breaks(html, '.content')
print(text)

Working with Simple HTML DOM's Built-in Methods

Using plaintext vs innertext Properties

Simple HTML DOM provides different properties for text extraction:

<?php
$html = '<div>
    <p>First paragraph.</p>
    <p>Second paragraph<br>with line break.</p>
</div>';

$dom = str_get_html($html);
$element = $dom->find('div', 0);

// plaintext removes all HTML and collapses whitespace
echo "Plaintext: " . $element->plaintext . "\n";
// Output: "First paragraph. Second paragraph with line break."

// innertext preserves HTML structure
echo "Innertext: " . $element->innertext . "\n";
// Output: "<p>First paragraph.</p> <p>Second paragraph<br>with line break.</p>"

// Custom method to preserve line breaks as text
function extractWithBreaks($element) {
    $html = $element->innertext;
    $html = preg_replace('/<br\s*\/?>/', "\n", $html);
    $html = preg_replace('/<\/p>/', "\n", $html);
    return trim(strip_tags($html));
}

echo "With breaks: " . extractWithBreaks($element) . "\n";
// Output: "First paragraph.\nSecond paragraph\nwith line break."
?>

Handling Nested Structures

When dealing with complex nested HTML:

<?php
function extractNestedText($element, $preserveStructure = true) {
    $html = $element->innertext;

    if ($preserveStructure) {
        // Preserve hierarchy with indentation
        $replacements = [
            '/<h([1-6])[^>]*>/' => str_repeat('#', 1) . ' ',
            '/<\/h[1-6]>/' => "\n\n",
            '/<p[^>]*>/' => '',
            '/<\/p>/' => "\n\n",
            '/<br\s*\/?>/' => "\n",
            '/<li[^>]*>/' => '• ',
            '/<\/li>/' => "\n",
            '/<ul[^>]*>/' => "\n",
            '/<\/ul>/' => "\n",
            '/<ol[^>]*>/' => "\n",
            '/<\/ol>/' => "\n",
            '/<blockquote[^>]*>/' => '> ',
            '/<\/blockquote>/' => "\n\n",
        ];
    } else {
        // Simple conversion
        $replacements = [
            '/<br\s*\/?>/' => "\n",
            '/<\/p>/' => "\n\n",
            '/<\/div>/' => "\n",
            '/<\/h[1-6]>/' => "\n\n",
            '/<\/li>/' => "\n",
        ];
    }

    foreach ($replacements as $pattern => $replacement) {
        $html = preg_replace($pattern, $replacement, $html);
    }

    $text = strip_tags($html);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/[ \t]+/', ' ', $text);

    return trim($text);
}

$html = '<article>
    <h1>Main Article</h1>
    <p>Introduction paragraph.</p>
    <h2>Subsection</h2>
    <ul>
        <li>First item</li>
        <li>Second item</li>
    </ul>
    <blockquote>
        <p>This is a quoted text<br>with a line break.</p>
    </blockquote>
</article>';

$dom = str_get_html($html);
$article = $dom->find('article', 0);
$structuredText = extractNestedText($article, true);

echo $structuredText;
/*
Output:
# Main Article

Introduction paragraph.

## Subsection

• First item
• Second item

> This is a quoted text
> with a line break.
*/
?>

Best Practices and Considerations

1. Handle Different HTML Structures

Different websites structure their HTML differently. Always inspect the source to understand: - How line breaks are represented (<br>, <p>, block elements) - Whether content is dynamically loaded - If special formatting needs to be preserved

2. Memory Management

When processing large documents:

<?php
// For large documents, process in chunks
function processLargeDocument($html) {
    $dom = str_get_html($html);

    // Process sections individually to manage memory
    $sections = $dom->find('.section');
    $results = [];

    foreach ($sections as $section) {
        $text = htmlToTextWithLineBreaks($section->innertext);
        $results[] = $text;

        // Clear section from memory
        $section->clear();
    }

    // Clear the entire DOM
    $dom->clear();

    return implode("\n\n", $results);
}
?>

3. Character Encoding

Ensure proper character encoding handling:

<?php
function extractTextWithEncoding($html, $encoding = 'UTF-8') {
    // Convert to UTF-8 if needed
    if ($encoding !== 'UTF-8') {
        $html = mb_convert_encoding($html, 'UTF-8', $encoding);
    }

    $dom = str_get_html($html);
    $text = htmlToTextWithLineBreaks($dom->innertext);
    $dom->clear();

    return $text;
}
?>

Troubleshooting Common Issues

Issue 1: Missing Line Breaks

If line breaks aren't preserved: - Check if the original HTML uses <br> tags or CSS for line breaks - Verify that your replacement patterns match the HTML structure - Consider using browser automation tools for CSS-rendered content

Issue 2: Extra Whitespace

Clean up excessive whitespace:

<?php
function cleanWhitespace($text) {
    // Remove trailing spaces from lines
    $text = preg_replace('/[ \t]+$/m', '', $text);

    // Remove leading spaces from lines (except for intentional indentation)
    $text = preg_replace('/^[ \t]+/m', '', $text);

    // Normalize multiple spaces to single spaces
    $text = preg_replace('/[ \t]+/', ' ', $text);

    // Limit consecutive newlines
    $text = preg_replace('/\n{3,}/', "\n\n", $text);

    return trim($text);
}
?>

Issue 3: Special Characters

Handle HTML entities and special characters:

<?php
function decodeHtmlEntities($text) {
    return html_entity_decode($text, ENT_QUOTES | ENT_HTML5, 'UTF-8');
}

$text = htmlToTextWithLineBreaks($element->innertext);
$text = decodeHtmlEntities($text);
?>

Performance Optimization

Batch Processing

For better performance when processing multiple elements:

<?php
function batchExtractText($dom, $selectors) {
    $results = [];

    foreach ($selectors as $selector) {
        $elements = $dom->find($selector);
        foreach ($elements as $element) {
            $text = htmlToTextWithLineBreaks($element->innertext);
            if (!empty(trim($text))) {
                $results[$selector][] = $text;
            }
        }
    }

    return $results;
}

// Usage
$selectors = ['.article-content', '.sidebar-content', '.footer-content'];
$extractedTexts = batchExtractText($dom, $selectors);
?>

Regular Expression Optimization

For better performance with large documents:

<?php
function optimizedTextExtraction($html) {
    // Use single regex for multiple replacements
    $patterns = [
        '/<br\s*\/?>/',
        '/<\/p>/',
        '/<\/div>/',
        '/<\/h[1-6]>/',
    ];

    $replacements = [
        "\n",
        "\n\n",
        "\n",
        "\n\n",
    ];

    $html = preg_replace($patterns, $replacements, $html);
    $text = strip_tags($html);

    // Single cleanup pass
    return preg_replace(['/\n{3,}/', '/[ \t]+/'], ["\n\n", ' '], trim($text));
}
?>

Conclusion

Extracting text content while preserving line breaks requires understanding both the HTML structure and the desired output format. Simple HTML DOM provides the foundation, but combining it with proper text processing techniques ensures you maintain the original content's readability and structure. For dynamic content, consider using headless browser solutions that can render JavaScript-generated content before extraction.

Remember to always test your extraction methods with different HTML structures and consider edge cases like nested elements, mixed content types, and special formatting requirements specific to your use case. The key is to balance preservation of formatting with clean, readable output that serves your specific needs.

Table of contents