How do I extract text content while preserving line breaks?
When scraping web content, preserving the original line breaks and formatting is crucial for maintaining readability and structure. Simple HTML DOM parser provides several methods to extract text while keeping line breaks intact, but the approach depends on how the HTML is structured and what output format you need.
Understanding HTML Line Breaks
HTML represents line breaks in different ways:
- <br>
tags for single line breaks
- <p>
tags for paragraph breaks
- Block-level elements like <div>
, <h1>-<h6>
that naturally create line breaks
- Whitespace characters that may be collapsed by browsers
Basic Text Extraction with Simple HTML DOM
Method 1: Using innertext Property
The innertext
property in Simple HTML DOM automatically preserves some formatting:
<?php
require_once 'simple_html_dom.php';
$html = '<div>
<p>First paragraph with text.</p>
<p>Second paragraph<br>with a line break.</p>
<div>Third section in a div.</div>
</div>';
$dom = str_get_html($html);
$content = $dom->find('div', 0)->innertext;
// This preserves HTML tags, including <br> and <p>
echo $content;
?>
Method 2: Converting HTML to Text with Line Breaks
To extract plain text while preserving line breaks, you need to convert HTML elements to text equivalents:
<?php
function htmlToTextWithLineBreaks($html) {
// Replace <br> tags with newlines
$html = preg_replace('/<br\s*\/?>/', "\n", $html);
// Replace paragraph and div endings with double newlines
$html = preg_replace('/<\/p>/i', "\n\n", $html);
$html = preg_replace('/<\/div>/i', "\n", $html);
// Replace heading endings with newlines
$html = preg_replace('/<\/h[1-6]>/i', "\n", $html);
// Remove all remaining HTML tags
$text = strip_tags($html);
// Clean up multiple consecutive newlines
$text = preg_replace('/\n{3,}/', "\n\n", $text);
// Trim whitespace
return trim($text);
}
$html = '<div>
<h2>Article Title</h2>
<p>First paragraph of content.</p>
<p>Second paragraph with<br>a line break in the middle.</p>
<div>Additional content in a div.</div>
</div>';
$dom = str_get_html($html);
$element = $dom->find('div', 0);
$text = htmlToTextWithLineBreaks($element->innertext);
echo $text;
/*
Output:
Article Title
First paragraph of content.
Second paragraph with
a line break in the middle.
Additional content in a div.
*/
?>
Advanced Text Extraction Techniques
Preserving Specific Formatting Elements
Sometimes you want to preserve certain formatting while converting others:
<?php
function extractFormattedText($element) {
$html = $element->innertext;
// Replace different HTML elements with appropriate text formatting
$replacements = [
'/<br\s*\/?>/' => "\n",
'/<\/p>/' => "\n\n",
'/<\/div>/' => "\n",
'/<\/h[1-6]>/' => "\n",
'/<\/li>/' => "\n",
'/<ul[^>]*>/' => "\n",
'/<\/ul>/' => "\n",
'/<ol[^>]*>/' => "\n",
'/<\/ol>/' => "\n",
'/<strong[^>]*>(.*?)<\/strong>/i' => '**$1**',
'/<b[^>]*>(.*?)<\/b>/i' => '**$1**',
'/<em[^>]*>(.*?)<\/em>/i' => '*$1*',
'/<i[^>]*>(.*?)<\/i>/i' => '*$1*',
];
foreach ($replacements as $pattern => $replacement) {
$html = preg_replace($pattern, $replacement, $html);
}
// Remove remaining HTML tags
$text = strip_tags($html);
// Clean up whitespace
$text = preg_replace('/[ \t]+/', ' ', $text);
$text = preg_replace('/\n\s+/', "\n", $text);
$text = preg_replace('/\n{3,}/', "\n\n", $text);
return trim($text);
}
$html = '<article>
<h1>Main Title</h1>
<p>This is <strong>important text</strong> with <em>emphasis</em>.</p>
<ul>
<li>First list item</li>
<li>Second list item</li>
</ul>
<p>Final paragraph with<br>a line break.</p>
</article>';
$dom = str_get_html($html);
$article = $dom->find('article', 0);
$formattedText = extractFormattedText($article);
echo $formattedText;
/*
Output:
Main Title
This is **important text** with *emphasis*.
First list item
Second list item
Final paragraph with
a line break.
*/
?>
Working with Multiple Elements
When extracting text from multiple elements while preserving structure:
<?php
function extractMultipleElementsText($dom, $selector) {
$elements = $dom->find($selector);
$textContent = [];
foreach ($elements as $element) {
$text = htmlToTextWithLineBreaks($element->innertext);
if (!empty(trim($text))) {
$textContent[] = $text;
}
}
return implode("\n\n---\n\n", $textContent);
}
$html = '<div class="content">
<div class="section">
<h3>Section 1</h3>
<p>Content for section one.</p>
</div>
<div class="section">
<h3>Section 2</h3>
<p>Content for section two<br>with a line break.</p>
</div>
</div>';
$dom = str_get_html($html);
$sectionsText = extractMultipleElementsText($dom, '.section');
echo $sectionsText;
?>
Alternative Approaches with Other Tools
Using JavaScript with Puppeteer
For JavaScript-heavy pages, using Puppeteer for dynamic content extraction can be more effective:
const puppeteer = require('puppeteer');
async function extractTextWithLineBreaks(url, selector) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const text = await page.evaluate((sel) => {
const element = document.querySelector(sel);
if (!element) return '';
// Get text content while preserving line breaks
let text = element.innerHTML;
// Replace HTML elements with text equivalents
text = text.replace(/<br\s*\/?>/gi, '\n');
text = text.replace(/<\/p>/gi, '\n\n');
text = text.replace(/<\/div>/gi, '\n');
text = text.replace(/<\/h[1-6]>/gi, '\n');
// Remove remaining HTML tags
text = text.replace(/<[^>]+>/g, '');
// Clean up whitespace
text = text.replace(/\s+/g, ' ');
text = text.replace(/\n\s+/g, '\n');
text = text.replace(/\n{3,}/g, '\n\n');
return text.trim();
}, selector);
await browser.close();
return text;
}
// Usage
extractTextWithLineBreaks('https://example.com', '.article-content')
.then(text => console.log(text));
Using Python with BeautifulSoup
from bs4 import BeautifulSoup
import re
def extract_text_with_breaks(html, selector=None):
soup = BeautifulSoup(html, 'html.parser')
if selector:
element = soup.select_one(selector)
if not element:
return ""
else:
element = soup
# Replace line break elements with actual newlines
for br in element.find_all('br'):
br.replace_with('\n')
# Add newlines after block elements
for tag in element.find_all(['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
tag.append('\n')
# Extract text
text = element.get_text()
# Clean up whitespace
text = re.sub(r'\n\s+', '\n', text)
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r'[ \t]+', ' ', text)
return text.strip()
# Usage
html = '''
<div class="content">
<h2>Title</h2>
<p>First paragraph.</p>
<p>Second paragraph with<br>line break.</p>
</div>
'''
text = extract_text_with_breaks(html, '.content')
print(text)
Working with Simple HTML DOM's Built-in Methods
Using plaintext vs innertext Properties
Simple HTML DOM provides different properties for text extraction:
<?php
$html = '<div>
<p>First paragraph.</p>
<p>Second paragraph<br>with line break.</p>
</div>';
$dom = str_get_html($html);
$element = $dom->find('div', 0);
// plaintext removes all HTML and collapses whitespace
echo "Plaintext: " . $element->plaintext . "\n";
// Output: "First paragraph. Second paragraph with line break."
// innertext preserves HTML structure
echo "Innertext: " . $element->innertext . "\n";
// Output: "<p>First paragraph.</p> <p>Second paragraph<br>with line break.</p>"
// Custom method to preserve line breaks as text
function extractWithBreaks($element) {
$html = $element->innertext;
$html = preg_replace('/<br\s*\/?>/', "\n", $html);
$html = preg_replace('/<\/p>/', "\n", $html);
return trim(strip_tags($html));
}
echo "With breaks: " . extractWithBreaks($element) . "\n";
// Output: "First paragraph.\nSecond paragraph\nwith line break."
?>
Handling Nested Structures
When dealing with complex nested HTML:
<?php
function extractNestedText($element, $preserveStructure = true) {
$html = $element->innertext;
if ($preserveStructure) {
// Preserve hierarchy with indentation
$replacements = [
'/<h([1-6])[^>]*>/' => str_repeat('#', 1) . ' ',
'/<\/h[1-6]>/' => "\n\n",
'/<p[^>]*>/' => '',
'/<\/p>/' => "\n\n",
'/<br\s*\/?>/' => "\n",
'/<li[^>]*>/' => '• ',
'/<\/li>/' => "\n",
'/<ul[^>]*>/' => "\n",
'/<\/ul>/' => "\n",
'/<ol[^>]*>/' => "\n",
'/<\/ol>/' => "\n",
'/<blockquote[^>]*>/' => '> ',
'/<\/blockquote>/' => "\n\n",
];
} else {
// Simple conversion
$replacements = [
'/<br\s*\/?>/' => "\n",
'/<\/p>/' => "\n\n",
'/<\/div>/' => "\n",
'/<\/h[1-6]>/' => "\n\n",
'/<\/li>/' => "\n",
];
}
foreach ($replacements as $pattern => $replacement) {
$html = preg_replace($pattern, $replacement, $html);
}
$text = strip_tags($html);
$text = preg_replace('/\n{3,}/', "\n\n", $text);
$text = preg_replace('/[ \t]+/', ' ', $text);
return trim($text);
}
$html = '<article>
<h1>Main Article</h1>
<p>Introduction paragraph.</p>
<h2>Subsection</h2>
<ul>
<li>First item</li>
<li>Second item</li>
</ul>
<blockquote>
<p>This is a quoted text<br>with a line break.</p>
</blockquote>
</article>';
$dom = str_get_html($html);
$article = $dom->find('article', 0);
$structuredText = extractNestedText($article, true);
echo $structuredText;
/*
Output:
# Main Article
Introduction paragraph.
## Subsection
• First item
• Second item
> This is a quoted text
> with a line break.
*/
?>
Best Practices and Considerations
1. Handle Different HTML Structures
Different websites structure their HTML differently. Always inspect the source to understand:
- How line breaks are represented (<br>
, <p>
, block elements)
- Whether content is dynamically loaded
- If special formatting needs to be preserved
2. Memory Management
When processing large documents:
<?php
// For large documents, process in chunks
function processLargeDocument($html) {
$dom = str_get_html($html);
// Process sections individually to manage memory
$sections = $dom->find('.section');
$results = [];
foreach ($sections as $section) {
$text = htmlToTextWithLineBreaks($section->innertext);
$results[] = $text;
// Clear section from memory
$section->clear();
}
// Clear the entire DOM
$dom->clear();
return implode("\n\n", $results);
}
?>
3. Character Encoding
Ensure proper character encoding handling:
<?php
function extractTextWithEncoding($html, $encoding = 'UTF-8') {
// Convert to UTF-8 if needed
if ($encoding !== 'UTF-8') {
$html = mb_convert_encoding($html, 'UTF-8', $encoding);
}
$dom = str_get_html($html);
$text = htmlToTextWithLineBreaks($dom->innertext);
$dom->clear();
return $text;
}
?>
Troubleshooting Common Issues
Issue 1: Missing Line Breaks
If line breaks aren't preserved:
- Check if the original HTML uses <br>
tags or CSS for line breaks
- Verify that your replacement patterns match the HTML structure
- Consider using browser automation tools for CSS-rendered content
Issue 2: Extra Whitespace
Clean up excessive whitespace:
<?php
function cleanWhitespace($text) {
// Remove trailing spaces from lines
$text = preg_replace('/[ \t]+$/m', '', $text);
// Remove leading spaces from lines (except for intentional indentation)
$text = preg_replace('/^[ \t]+/m', '', $text);
// Normalize multiple spaces to single spaces
$text = preg_replace('/[ \t]+/', ' ', $text);
// Limit consecutive newlines
$text = preg_replace('/\n{3,}/', "\n\n", $text);
return trim($text);
}
?>
Issue 3: Special Characters
Handle HTML entities and special characters:
<?php
function decodeHtmlEntities($text) {
return html_entity_decode($text, ENT_QUOTES | ENT_HTML5, 'UTF-8');
}
$text = htmlToTextWithLineBreaks($element->innertext);
$text = decodeHtmlEntities($text);
?>
Performance Optimization
Batch Processing
For better performance when processing multiple elements:
<?php
function batchExtractText($dom, $selectors) {
$results = [];
foreach ($selectors as $selector) {
$elements = $dom->find($selector);
foreach ($elements as $element) {
$text = htmlToTextWithLineBreaks($element->innertext);
if (!empty(trim($text))) {
$results[$selector][] = $text;
}
}
}
return $results;
}
// Usage
$selectors = ['.article-content', '.sidebar-content', '.footer-content'];
$extractedTexts = batchExtractText($dom, $selectors);
?>
Regular Expression Optimization
For better performance with large documents:
<?php
function optimizedTextExtraction($html) {
// Use single regex for multiple replacements
$patterns = [
'/<br\s*\/?>/',
'/<\/p>/',
'/<\/div>/',
'/<\/h[1-6]>/',
];
$replacements = [
"\n",
"\n\n",
"\n",
"\n\n",
];
$html = preg_replace($patterns, $replacements, $html);
$text = strip_tags($html);
// Single cleanup pass
return preg_replace(['/\n{3,}/', '/[ \t]+/'], ["\n\n", ' '], trim($text));
}
?>
Conclusion
Extracting text content while preserving line breaks requires understanding both the HTML structure and the desired output format. Simple HTML DOM provides the foundation, but combining it with proper text processing techniques ensures you maintain the original content's readability and structure. For dynamic content, consider using headless browser solutions that can render JavaScript-generated content before extraction.
Remember to always test your extraction methods with different HTML structures and consider edge cases like nested elements, mixed content types, and special formatting requirements specific to your use case. The key is to balance preservation of formatting with clean, readable output that serves your specific needs.