What is the Simple HTML DOM Parser and how do I use it?
The Simple HTML DOM Parser is a lightweight, fast, and easy-to-use PHP library designed for parsing HTML documents and extracting data from web pages. Originally created by S.C. Chen, this library provides an intuitive DOM manipulation interface similar to jQuery, making it an excellent choice for web scraping tasks in PHP applications.
What is Simple HTML DOM Parser?
Simple HTML DOM Parser is a pure PHP library that creates a DOM tree from HTML content, allowing developers to navigate, search, and manipulate HTML elements using familiar CSS selectors and XPath expressions. Unlike heavier alternatives, this parser is designed to be memory-efficient and fast, making it ideal for parsing large HTML documents or processing multiple pages.
Key Features
- CSS Selector Support: Find elements using CSS selectors like
div.class
,#id
,a[href]
- XPath Support: Advanced element selection using XPath expressions
- Memory Efficient: Optimized for handling large HTML documents
- jQuery-like Syntax: Familiar API for developers coming from frontend development
- No External Dependencies: Pure PHP implementation
- HTML Manipulation: Create, modify, and delete HTML elements
- Encoding Support: Handles various character encodings automatically
Installation and Setup
Using Composer (Recommended)
composer require sunra/php-simple-html-dom-parser
Manual Installation
Download the simple_html_dom.php
file and include it in your project:
<?php
require_once 'simple_html_dom.php';
Basic Usage Examples
Loading HTML Content
<?php
require_once 'simple_html_dom.php';
// Load HTML from a string
$html = str_get_html('<div class="content">Hello World</div>');
// Load HTML from a file
$html = file_get_html('example.html');
// Load HTML from a URL
$html = file_get_html('https://example.com');
Finding Elements with CSS Selectors
<?php
// Find all div elements
$divs = $html->find('div');
// Find element by ID
$element = $html->find('#header', 0); // 0 gets the first match
// Find elements by class
$elements = $html->find('.content');
// Find elements with specific attributes
$links = $html->find('a[href]');
$external_links = $html->find('a[href^="http"]');
// Complex selectors
$nested_elements = $html->find('div.container p.text');
Extracting Data
<?php
// Get element text content
foreach($html->find('h1') as $header) {
echo $header->plaintext . "\n";
}
// Get element HTML
foreach($html->find('div.article') as $article) {
echo $article->outertext . "\n";
}
// Get attribute values
foreach($html->find('a') as $link) {
echo "Link: " . $link->href . " - Text: " . $link->plaintext . "\n";
}
// Get form input values
foreach($html->find('input[type="text"]') as $input) {
echo "Input name: " . $input->name . " - Value: " . $input->value . "\n";
}
Advanced Usage Patterns
Scraping a Complete Web Page
<?php
function scrapeProductData($url) {
$html = file_get_html($url);
if (!$html) {
throw new Exception("Failed to load HTML from URL: $url");
}
$products = [];
foreach($html->find('.product-item') as $product) {
$title = $product->find('.product-title', 0);
$price = $product->find('.price', 0);
$image = $product->find('img', 0);
$products[] = [
'title' => $title ? trim($title->plaintext) : '',
'price' => $price ? trim($price->plaintext) : '',
'image_url' => $image ? $image->src : '',
'product_url' => $product->find('a', 0)->href ?? ''
];
}
// Clean up memory
$html->clear();
unset($html);
return $products;
}
// Usage
try {
$products = scrapeProductData('https://example-shop.com/products');
foreach($products as $product) {
echo "Product: {$product['title']} - Price: {$product['price']}\n";
}
} catch (Exception $e) {
echo "Error: " . $e->getMessage() . "\n";
}
Handling Forms and Tables
<?php
// Extract table data
function extractTableData($html, $table_selector = 'table') {
$table = $html->find($table_selector, 0);
if (!$table) {
return [];
}
$data = [];
$headers = [];
// Get headers
foreach($table->find('thead th') as $index => $header) {
$headers[$index] = trim($header->plaintext);
}
// Get rows
foreach($table->find('tbody tr') as $row) {
$row_data = [];
foreach($row->find('td') as $index => $cell) {
$header_key = $headers[$index] ?? "column_$index";
$row_data[$header_key] = trim($cell->plaintext);
}
$data[] = $row_data;
}
return $data;
}
// Extract form data
function extractFormFields($html, $form_selector = 'form') {
$form = $html->find($form_selector, 0);
if (!$form) {
return [];
}
$fields = [];
foreach($form->find('input, select, textarea') as $field) {
$name = $field->name ?? '';
$type = $field->type ?? $field->tag;
$value = $field->value ?? '';
if ($name) {
$fields[$name] = [
'type' => $type,
'value' => $value,
'required' => isset($field->required)
];
}
}
return $fields;
}
Error Handling and Memory Management
<?php
function safeHtmlParse($content) {
try {
// Set memory limit for large documents
ini_set('memory_limit', '256M');
$html = str_get_html($content);
if (!$html) {
throw new Exception("Failed to parse HTML content");
}
// Process your data here
$result = processHtmlData($html);
// Always clean up
$html->clear();
unset($html);
return $result;
} catch (Exception $e) {
error_log("HTML parsing error: " . $e->getMessage());
return false;
}
}
function processHtmlData($html) {
// Your data extraction logic
return $html->find('title', 0)->plaintext ?? 'No title found';
}
Working with Different Content Types
Handling AJAX Content
While Simple HTML DOM Parser works with static HTML, for JavaScript-heavy sites, you might need to combine it with other tools. For dynamic content that loads after page load, consider using headless browser solutions for comprehensive scraping.
<?php
// For static HTML with embedded JSON data
function extractJsonData($html) {
foreach($html->find('script[type="application/json"]') as $script) {
$json_data = json_decode($script->innertext, true);
if ($json_data) {
return $json_data;
}
}
return null;
}
Processing Multiple Pages
<?php
function scrapeMultiplePages($base_url, $page_count) {
$all_data = [];
for ($page = 1; $page <= $page_count; $page++) {
$url = $base_url . "?page=" . $page;
echo "Processing page $page...\n";
$html = file_get_html($url);
if ($html) {
$page_data = extractPageData($html);
$all_data = array_merge($all_data, $page_data);
// Clean up memory after each page
$html->clear();
unset($html);
// Be respectful - add delay between requests
sleep(1);
}
}
return $all_data;
}
function extractPageData($html) {
$items = [];
foreach($html->find('.item') as $item) {
$items[] = [
'title' => $item->find('.title', 0)->plaintext ?? '',
'description' => $item->find('.description', 0)->plaintext ?? '',
'link' => $item->find('a', 0)->href ?? ''
];
}
return $items;
}
Performance Optimization Tips
Memory Management
<?php
// Always clear HTML objects when done
$html = file_get_html($url);
// ... process data
$html->clear();
unset($html);
// For large documents, process in chunks
function processLargeHtml($content) {
// Split content if needed
$max_size = 1024 * 1024; // 1MB chunks
if (strlen($content) > $max_size) {
// Process in smaller sections
return processInChunks($content, $max_size);
}
return str_get_html($content);
}
Efficient Selectors
<?php
// Use specific selectors to reduce processing time
$specific = $html->find('div#content .article-title', 0); // Good
$broad = $html->find('*'); // Avoid - too broad
// Limit results when you only need a few
$first_link = $html->find('a', 0); // Get only first match
$first_five = array_slice($html->find('a'), 0, 5); // Limit results
Common Use Cases and Examples
E-commerce Product Scraping
<?php
class ProductScraper {
private $base_url;
public function __construct($base_url) {
$this->base_url = $base_url;
}
public function scrapeProduct($product_url) {
$html = file_get_html($product_url);
if (!$html) {
return null;
}
$product = [
'name' => $this->extractText($html, '.product-name'),
'price' => $this->extractPrice($html, '.price'),
'description' => $this->extractText($html, '.product-description'),
'images' => $this->extractImages($html, '.product-images img'),
'specifications' => $this->extractSpecs($html, '.specs-table'),
'availability' => $this->extractText($html, '.availability')
];
$html->clear();
return $product;
}
private function extractText($html, $selector) {
$element = $html->find($selector, 0);
return $element ? trim($element->plaintext) : '';
}
private function extractPrice($html, $selector) {
$price_text = $this->extractText($html, $selector);
// Clean price text and convert to number
return (float) preg_replace('/[^0-9.]/', '', $price_text);
}
private function extractImages($html, $selector) {
$images = [];
foreach($html->find($selector) as $img) {
if ($img->src) {
$images[] = $this->resolveUrl($img->src);
}
}
return $images;
}
private function extractSpecs($html, $selector) {
$specs = [];
$table = $html->find($selector, 0);
if ($table) {
foreach($table->find('tr') as $row) {
$cells = $row->find('td');
if (count($cells) >= 2) {
$key = trim($cells[0]->plaintext);
$value = trim($cells[1]->plaintext);
$specs[$key] = $value;
}
}
}
return $specs;
}
private function resolveUrl($url) {
if (strpos($url, 'http') === 0) {
return $url;
}
return rtrim($this->base_url, '/') . '/' . ltrim($url, '/');
}
}
Best Practices and Tips
1. Always Handle Errors
<?php
$html = file_get_html($url);
if (!$html) {
throw new Exception("Failed to load HTML from $url");
}
2. Respect Websites
<?php
// Add delays between requests
sleep(1);
// Check robots.txt
// Set appropriate User-Agent
$context = stream_context_create([
'http' => [
'user_agent' => 'Mozilla/5.0 (compatible; MyBot/1.0)'
]
]);
3. Clean Up Memory
<?php
// Always clean up after processing
$html->clear();
unset($html);
4. Use Caching for Development
<?php
function getCachedHtml($url, $cache_time = 3600) {
$cache_file = 'cache/' . md5($url) . '.html';
if (file_exists($cache_file) && (time() - filemtime($cache_file)) < $cache_time) {
return file_get_html($cache_file);
}
$html = file_get_html($url);
if ($html) {
file_put_contents($cache_file, $html->save());
}
return $html;
}
Alternative Solutions
While Simple HTML DOM Parser is excellent for basic HTML parsing, consider these alternatives for specific use cases:
- For JavaScript-heavy sites: Use headless browsers like Puppeteer for handling dynamic content
- For large-scale scraping: Consider Scrapy with PHP bridges
- For XML processing: Use PHP's built-in DOMDocument
- For modern PHP projects: Symfony DomCrawler component
Conclusion
Simple HTML DOM Parser remains one of the most accessible and efficient tools for HTML parsing in PHP. Its jQuery-like syntax makes it easy to learn, while its lightweight nature ensures good performance for most web scraping tasks. Whether you're extracting data from static websites, processing forms, or building automated content aggregators, this library provides the essential tools needed for effective HTML manipulation.
Remember to always follow ethical scraping practices, respect robots.txt files, and implement appropriate delays and error handling in your applications. For more complex scenarios involving dynamic content, consider integrating Simple HTML DOM Parser with headless browser solutions for comprehensive web scraping capabilities.