Simple HTML DOM is a lightweight PHP library that provides an easy way to manipulate HTML documents. It's particularly popular for web scraping tasks because it offers jQuery-like CSS selectors and a simple API for parsing HTML content.
Installation Methods
Method 1: Composer Installation (Recommended)
The easiest and most reliable way to install Simple HTML DOM is through Composer:
composer require voku/simple_html_dom
Note: The original simple-html-dom/simple-html-dom
package is no longer maintained. Use voku/simple_html_dom
for the actively maintained fork with PHP 8+ support.
After installation, you can use the library with autoloading:
<?php
require_once 'vendor/autoload.php';
use voku\helper\HtmlDomParser;
// Parse HTML from string
$html = HtmlDomParser::str_get_html('<html><body><h1>Hello World</h1></body></html>');
// Find and output the h1 element
$h1 = $html->findOne('h1');
echo $h1->text(); // Outputs: Hello World
// Parse HTML from URL
$html = HtmlDomParser::file_get_html('https://example.com');
// Find all links
foreach ($html->find('a') as $link) {
echo $link->href . "\n";
}
Method 2: Manual Installation
For projects not using Composer, you can install manually:
Download the library:
- Visit: https://github.com/voku/simple_html_dom
- Download the latest release or clone the repository
Include the required files:
<?php
// Include the main library file
require_once 'path/to/simple_html_dom/src/voku/helper/HtmlDomParser.php';
// You may need to include additional dependencies manually
// Check the composer.json for required packages
use voku\helper\HtmlDomParser;
$html = HtmlDomParser::str_get_html('<div>Hello</div>');
echo $html->find('div')[0]->text();
Legacy Simple HTML DOM
If you need to use the original Simple HTML DOM library (not recommended for new projects):
# For legacy projects only
composer require simple-html-dom/simple-html-dom
<?php
require_once 'vendor/autoload.php';
// Create DOM object from string
$html = str_get_html('<html><body>Hello!</body></html>');
// Find elements
$body = $html->find('body', 0);
echo $body->innertext;
// Clean up memory
$html->clear();
unset($html);
System Requirements
- PHP: 7.0+ (8.0+ recommended for voku/simple_html_dom)
- Extensions:
dom
extension (usually enabled by default)libxml
extensionmbstring
extension (recommended)
Check your PHP version and extensions:
php -v
php -m | grep -E "(dom|libxml|mbstring)"
Common Usage Examples
Basic HTML Parsing
<?php
use voku\helper\HtmlDomParser;
$html = HtmlDomParser::str_get_html('
<div class="container">
<h1 id="title">Welcome</h1>
<p class="text">This is a paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</div>
');
// Find by ID
$title = $html->findOne('#title');
echo $title->text(); // "Welcome"
// Find by class
$paragraph = $html->findOne('.text');
echo $paragraph->text(); // "This is a paragraph."
// Find multiple elements
$items = $html->find('li');
foreach ($items as $item) {
echo $item->text() . "\n";
}
Web Scraping Example
<?php
use voku\helper\HtmlDomParser;
// Scrape a website
$html = HtmlDomParser::file_get_html('https://news.ycombinator.com');
// Extract article titles and URLs
$articles = $html->find('.titleline > a');
foreach ($articles as $article) {
$title = $article->text();
$url = $article->href;
echo "Title: {$title}\n";
echo "URL: {$url}\n\n";
}
Troubleshooting
Common Issues and Solutions
1. Composer Installation Fails
# Clear Composer cache
composer clear-cache
# Update Composer
composer self-update
# Try installing with verbose output
composer require voku/simple_html_dom -v
2. Memory Issues with Large HTML
// Set memory limit for large documents
ini_set('memory_limit', '256M');
// Always clean up
$html->clear();
unset($html);
3. SSL Certificate Issues
// For file_get_html with HTTPS URLs
$context = stream_context_create([
'http' => [
'verify_peer' => false,
'verify_peer_name' => false,
]
]);
$html = HtmlDomParser::file_get_html('https://example.com', $context);
4. Character Encoding Problems
// Specify encoding when parsing
$html = HtmlDomParser::str_get_html($htmlString, 'UTF-8');
Best Practices
- Always check if elements exist before accessing their properties
- Clean up DOM objects to free memory:
$html->clear(); unset($html);
- Use appropriate selectors for better performance
- Respect robots.txt and website terms of service
- Implement rate limiting for web scraping
- Handle errors gracefully with try-catch blocks
<?php
try {
$html = HtmlDomParser::file_get_html('https://example.com');
if ($html === false) {
throw new Exception('Failed to load HTML');
}
$title = $html->findOne('title');
if ($title) {
echo $title->text();
}
} catch (Exception $e) {
echo "Error: " . $e->getMessage();
} finally {
if (isset($html)) {
$html->clear();
}
}