Table of contents

What is the Simple HTML DOM Parser and how do I use it?

The Simple HTML DOM Parser is a lightweight, fast, and easy-to-use PHP library designed for parsing HTML documents and extracting data from web pages. Originally created by S.C. Chen, this library provides an intuitive DOM manipulation interface similar to jQuery, making it an excellent choice for web scraping tasks in PHP applications.

What is Simple HTML DOM Parser?

Simple HTML DOM Parser is a pure PHP library that creates a DOM tree from HTML content, allowing developers to navigate, search, and manipulate HTML elements using familiar CSS selectors and XPath expressions. Unlike heavier alternatives, this parser is designed to be memory-efficient and fast, making it ideal for parsing large HTML documents or processing multiple pages.

Key Features

  • CSS Selector Support: Find elements using CSS selectors like div.class, #id, a[href]
  • XPath Support: Advanced element selection using XPath expressions
  • Memory Efficient: Optimized for handling large HTML documents
  • jQuery-like Syntax: Familiar API for developers coming from frontend development
  • No External Dependencies: Pure PHP implementation
  • HTML Manipulation: Create, modify, and delete HTML elements
  • Encoding Support: Handles various character encodings automatically

Installation and Setup

Using Composer (Recommended)

composer require sunra/php-simple-html-dom-parser

Manual Installation

Download the simple_html_dom.php file and include it in your project:

<?php
require_once 'simple_html_dom.php';

Basic Usage Examples

Loading HTML Content

<?php
require_once 'simple_html_dom.php';

// Load HTML from a string
$html = str_get_html('<div class="content">Hello World</div>');

// Load HTML from a file
$html = file_get_html('example.html');

// Load HTML from a URL
$html = file_get_html('https://example.com');

Finding Elements with CSS Selectors

<?php
// Find all div elements
$divs = $html->find('div');

// Find element by ID
$element = $html->find('#header', 0); // 0 gets the first match

// Find elements by class
$elements = $html->find('.content');

// Find elements with specific attributes
$links = $html->find('a[href]');
$external_links = $html->find('a[href^="http"]');

// Complex selectors
$nested_elements = $html->find('div.container p.text');

Extracting Data

<?php
// Get element text content
foreach($html->find('h1') as $header) {
    echo $header->plaintext . "\n";
}

// Get element HTML
foreach($html->find('div.article') as $article) {
    echo $article->outertext . "\n";
}

// Get attribute values
foreach($html->find('a') as $link) {
    echo "Link: " . $link->href . " - Text: " . $link->plaintext . "\n";
}

// Get form input values
foreach($html->find('input[type="text"]') as $input) {
    echo "Input name: " . $input->name . " - Value: " . $input->value . "\n";
}

Advanced Usage Patterns

Scraping a Complete Web Page

<?php
function scrapeProductData($url) {
    $html = file_get_html($url);

    if (!$html) {
        throw new Exception("Failed to load HTML from URL: $url");
    }

    $products = [];

    foreach($html->find('.product-item') as $product) {
        $title = $product->find('.product-title', 0);
        $price = $product->find('.price', 0);
        $image = $product->find('img', 0);

        $products[] = [
            'title' => $title ? trim($title->plaintext) : '',
            'price' => $price ? trim($price->plaintext) : '',
            'image_url' => $image ? $image->src : '',
            'product_url' => $product->find('a', 0)->href ?? ''
        ];
    }

    // Clean up memory
    $html->clear();
    unset($html);

    return $products;
}

// Usage
try {
    $products = scrapeProductData('https://example-shop.com/products');
    foreach($products as $product) {
        echo "Product: {$product['title']} - Price: {$product['price']}\n";
    }
} catch (Exception $e) {
    echo "Error: " . $e->getMessage() . "\n";
}

Handling Forms and Tables

<?php
// Extract table data
function extractTableData($html, $table_selector = 'table') {
    $table = $html->find($table_selector, 0);

    if (!$table) {
        return [];
    }

    $data = [];
    $headers = [];

    // Get headers
    foreach($table->find('thead th') as $index => $header) {
        $headers[$index] = trim($header->plaintext);
    }

    // Get rows
    foreach($table->find('tbody tr') as $row) {
        $row_data = [];
        foreach($row->find('td') as $index => $cell) {
            $header_key = $headers[$index] ?? "column_$index";
            $row_data[$header_key] = trim($cell->plaintext);
        }
        $data[] = $row_data;
    }

    return $data;
}

// Extract form data
function extractFormFields($html, $form_selector = 'form') {
    $form = $html->find($form_selector, 0);

    if (!$form) {
        return [];
    }

    $fields = [];

    foreach($form->find('input, select, textarea') as $field) {
        $name = $field->name ?? '';
        $type = $field->type ?? $field->tag;
        $value = $field->value ?? '';

        if ($name) {
            $fields[$name] = [
                'type' => $type,
                'value' => $value,
                'required' => isset($field->required)
            ];
        }
    }

    return $fields;
}

Error Handling and Memory Management

<?php
function safeHtmlParse($content) {
    try {
        // Set memory limit for large documents
        ini_set('memory_limit', '256M');

        $html = str_get_html($content);

        if (!$html) {
            throw new Exception("Failed to parse HTML content");
        }

        // Process your data here
        $result = processHtmlData($html);

        // Always clean up
        $html->clear();
        unset($html);

        return $result;

    } catch (Exception $e) {
        error_log("HTML parsing error: " . $e->getMessage());
        return false;
    }
}

function processHtmlData($html) {
    // Your data extraction logic
    return $html->find('title', 0)->plaintext ?? 'No title found';
}

Working with Different Content Types

Handling AJAX Content

While Simple HTML DOM Parser works with static HTML, for JavaScript-heavy sites, you might need to combine it with other tools. For dynamic content that loads after page load, consider using headless browser solutions for comprehensive scraping.

<?php
// For static HTML with embedded JSON data
function extractJsonData($html) {
    foreach($html->find('script[type="application/json"]') as $script) {
        $json_data = json_decode($script->innertext, true);
        if ($json_data) {
            return $json_data;
        }
    }
    return null;
}

Processing Multiple Pages

<?php
function scrapeMultiplePages($base_url, $page_count) {
    $all_data = [];

    for ($page = 1; $page <= $page_count; $page++) {
        $url = $base_url . "?page=" . $page;

        echo "Processing page $page...\n";

        $html = file_get_html($url);

        if ($html) {
            $page_data = extractPageData($html);
            $all_data = array_merge($all_data, $page_data);

            // Clean up memory after each page
            $html->clear();
            unset($html);

            // Be respectful - add delay between requests
            sleep(1);
        }
    }

    return $all_data;
}

function extractPageData($html) {
    $items = [];

    foreach($html->find('.item') as $item) {
        $items[] = [
            'title' => $item->find('.title', 0)->plaintext ?? '',
            'description' => $item->find('.description', 0)->plaintext ?? '',
            'link' => $item->find('a', 0)->href ?? ''
        ];
    }

    return $items;
}

Performance Optimization Tips

Memory Management

<?php
// Always clear HTML objects when done
$html = file_get_html($url);
// ... process data
$html->clear();
unset($html);

// For large documents, process in chunks
function processLargeHtml($content) {
    // Split content if needed
    $max_size = 1024 * 1024; // 1MB chunks

    if (strlen($content) > $max_size) {
        // Process in smaller sections
        return processInChunks($content, $max_size);
    }

    return str_get_html($content);
}

Efficient Selectors

<?php
// Use specific selectors to reduce processing time
$specific = $html->find('div#content .article-title', 0); // Good
$broad = $html->find('*'); // Avoid - too broad

// Limit results when you only need a few
$first_link = $html->find('a', 0); // Get only first match
$first_five = array_slice($html->find('a'), 0, 5); // Limit results

Common Use Cases and Examples

E-commerce Product Scraping

<?php
class ProductScraper {
    private $base_url;

    public function __construct($base_url) {
        $this->base_url = $base_url;
    }

    public function scrapeProduct($product_url) {
        $html = file_get_html($product_url);

        if (!$html) {
            return null;
        }

        $product = [
            'name' => $this->extractText($html, '.product-name'),
            'price' => $this->extractPrice($html, '.price'),
            'description' => $this->extractText($html, '.product-description'),
            'images' => $this->extractImages($html, '.product-images img'),
            'specifications' => $this->extractSpecs($html, '.specs-table'),
            'availability' => $this->extractText($html, '.availability')
        ];

        $html->clear();
        return $product;
    }

    private function extractText($html, $selector) {
        $element = $html->find($selector, 0);
        return $element ? trim($element->plaintext) : '';
    }

    private function extractPrice($html, $selector) {
        $price_text = $this->extractText($html, $selector);
        // Clean price text and convert to number
        return (float) preg_replace('/[^0-9.]/', '', $price_text);
    }

    private function extractImages($html, $selector) {
        $images = [];
        foreach($html->find($selector) as $img) {
            if ($img->src) {
                $images[] = $this->resolveUrl($img->src);
            }
        }
        return $images;
    }

    private function extractSpecs($html, $selector) {
        $specs = [];
        $table = $html->find($selector, 0);

        if ($table) {
            foreach($table->find('tr') as $row) {
                $cells = $row->find('td');
                if (count($cells) >= 2) {
                    $key = trim($cells[0]->plaintext);
                    $value = trim($cells[1]->plaintext);
                    $specs[$key] = $value;
                }
            }
        }

        return $specs;
    }

    private function resolveUrl($url) {
        if (strpos($url, 'http') === 0) {
            return $url;
        }
        return rtrim($this->base_url, '/') . '/' . ltrim($url, '/');
    }
}

Best Practices and Tips

1. Always Handle Errors

<?php
$html = file_get_html($url);
if (!$html) {
    throw new Exception("Failed to load HTML from $url");
}

2. Respect Websites

<?php
// Add delays between requests
sleep(1);

// Check robots.txt
// Set appropriate User-Agent
$context = stream_context_create([
    'http' => [
        'user_agent' => 'Mozilla/5.0 (compatible; MyBot/1.0)'
    ]
]);

3. Clean Up Memory

<?php
// Always clean up after processing
$html->clear();
unset($html);

4. Use Caching for Development

<?php
function getCachedHtml($url, $cache_time = 3600) {
    $cache_file = 'cache/' . md5($url) . '.html';

    if (file_exists($cache_file) && (time() - filemtime($cache_file)) < $cache_time) {
        return file_get_html($cache_file);
    }

    $html = file_get_html($url);
    if ($html) {
        file_put_contents($cache_file, $html->save());
    }

    return $html;
}

Alternative Solutions

While Simple HTML DOM Parser is excellent for basic HTML parsing, consider these alternatives for specific use cases:

  • For JavaScript-heavy sites: Use headless browsers like Puppeteer for handling dynamic content
  • For large-scale scraping: Consider Scrapy with PHP bridges
  • For XML processing: Use PHP's built-in DOMDocument
  • For modern PHP projects: Symfony DomCrawler component

Conclusion

Simple HTML DOM Parser remains one of the most accessible and efficient tools for HTML parsing in PHP. Its jQuery-like syntax makes it easy to learn, while its lightweight nature ensures good performance for most web scraping tasks. Whether you're extracting data from static websites, processing forms, or building automated content aggregators, this library provides the essential tools needed for effective HTML manipulation.

Remember to always follow ethical scraping practices, respect robots.txt files, and implement appropriate delays and error handling in your applications. For more complex scenarios involving dynamic content, consider integrating Simple HTML DOM Parser with headless browser solutions for comprehensive web scraping capabilities.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon