How do I use Simple HTML DOM to extract table data from a webpage?

Simple HTML DOM is a powerful PHP library for parsing HTML documents and extracting data. This guide shows you how to extract table data from webpages efficiently.

Installation

Using Composer (Recommended)

composer require simplehtmldom/simplehtmldom

require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;

Manual Installation

Download simple_html_dom.php and include it:

include_once('simple_html_dom.php');

Basic Table Extraction

1. Load the Webpage

// Direct URL loading
$html = file_get_html('https://example.com/data-table.html');

// Using HtmlWeb client (Composer version)
$client = new HtmlWeb();
$html = $client->load('https://example.com/data-table.html');

// Check if page loaded successfully
if (!$html) {
    die('Error loading page');
}

2. Find and Extract Table Data

// Find the first table
$table = $html->find('table', 0);

if (!$table) {
    die('No table found');
}

$tableData = [];

// Extract all rows
foreach ($table->find('tr') as $row) {
    $rowData = [];

    // Extract cells (both td and th)
    foreach ($row->find('td, th') as $cell) {
        $rowData[] = trim($cell->plaintext);
    }

    // Only add rows with data
    if (!empty($rowData)) {
        $tableData[] = $rowData;
    }
}

Advanced Table Selection

Target Specific Tables

// By ID
$table = $html->find('table#data-table', 0);

// By class
$table = $html->find('table.product-list', 0);

// By attribute
$table = $html->find('table[data-type=pricing]', 0);

// Multiple criteria
$table = $html->find('div.container table.data-grid', 0);

Handle Multiple Tables

$allTables = $html->find('table');

foreach ($allTables as $index => $table) {
    echo "Processing table " . ($index + 1) . "\n";

    $tableData = [];
    foreach ($table->find('tr') as $row) {
        $rowData = [];
        foreach ($row->find('td, th') as $cell) {
            $rowData[] = trim($cell->plaintext);
        }
        if (!empty($rowData)) {
            $tableData[] = $rowData;
        }
    }

    // Process each table's data
    processTableData($tableData, $index);
}

Advanced Data Processing

Separate Headers from Data

$table = $html->find('table', 0);
$headers = [];
$rows = [];

foreach ($table->find('tr') as $index => $row) {
    $rowData = [];

    if ($index === 0) {
        // First row as headers
        foreach ($row->find('th, td') as $cell) {
            $headers[] = trim($cell->plaintext);
        }
    } else {
        // Data rows
        foreach ($row->find('td') as $cell) {
            $rowData[] = trim($cell->plaintext);
        }

        if (!empty($rowData)) {
            $rows[] = array_combine($headers, $rowData);
        }
    }
}

Extract Additional Attributes

foreach ($table->find('tr') as $row) {
    $rowData = [];

    foreach ($row->find('td') as $cell) {
        $cellData = [
            'text' => trim($cell->plaintext),
            'html' => $cell->innertext,
            'class' => $cell->class ?? '',
            'data-value' => $cell->getAttribute('data-value') ?? ''
        ];

        // Extract links within cells
        $link = $cell->find('a', 0);
        if ($link) {
            $cellData['link'] = $link->href;
        }

        $rowData[] = $cellData;
    }

    $tableData[] = $rowData;
}

Complete Example with Error Handling

<?php
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;

function extractTableData($url, $tableSelector = 'table') {
    $client = new HtmlWeb();

    try {
        // Load the webpage
        $html = $client->load($url);

        if (!$html) {
            throw new Exception("Failed to load webpage: $url");
        }

        // Find the table
        $table = $html->find($tableSelector, 0);

        if (!$table) {
            throw new Exception("No table found with selector: $tableSelector");
        }

        $result = [
            'headers' => [],
            'rows' => []
        ];

        $rows = $table->find('tr');

        if (empty($rows)) {
            throw new Exception("No rows found in table");
        }

        foreach ($rows as $index => $row) {
            $rowData = [];

            // Extract cell data
            $cells = $row->find('td, th');
            foreach ($cells as $cell) {
                $rowData[] = trim($cell->plaintext);
            }

            if (!empty($rowData)) {
                if ($index === 0 && $row->find('th')) {
                    // First row with th elements = headers
                    $result['headers'] = $rowData;
                } else {
                    $result['rows'][] = $rowData;
                }
            }
        }

        // Clean up memory
        $html->clear();
        unset($html);

        return $result;

    } catch (Exception $e) {
        error_log("Table extraction error: " . $e->getMessage());
        return false;
    }
}

// Usage example
$url = 'https://example.com/data-table.html';
$data = extractTableData($url, 'table.data-grid');

if ($data) {
    echo "Headers: " . implode(', ', $data['headers']) . "\n";
    echo "Found " . count($data['rows']) . " data rows\n";

    // Display first few rows
    foreach (array_slice($data['rows'], 0, 3) as $row) {
        echo implode(' | ', $row) . "\n";
    }
} else {
    echo "Failed to extract table data\n";
}
?>

Data Export Options

Export to CSV

function exportToCSV($tableData, $filename) {
    $file = fopen($filename, 'w');

    foreach ($tableData as $row) {
        fputcsv($file, $row);
    }

    fclose($file);
}

// Usage
exportToCSV($tableData, 'extracted_data.csv');

Export to JSON

function exportToJSON($headers, $rows, $filename) {
    $jsonData = [];

    foreach ($rows as $row) {
        if (count($headers) === count($row)) {
            $jsonData[] = array_combine($headers, $row);
        }
    }

    file_put_contents($filename, json_encode($jsonData, JSON_PRETTY_PRINT));
}

Best Practices

Error Handling

// Always check if elements exist
if ($html && $table = $html->find('table', 0)) {
    // Process table
} else {
    // Handle error
}

// Validate data before processing
if (!empty($tableData) && is_array($tableData)) {
    // Process data
}

Memory Management

// Always clean up after processing
$html->clear();
unset($html);

// For large datasets, process in chunks
$tables = $html->find('table');
foreach ($tables as $table) {
    processTable($table);
    // Clear processed data
    unset($table);
}

Rate Limiting

// Add delays between requests
sleep(1); // Wait 1 second between requests

// Or use more sophisticated rate limiting
$lastRequest = time();
if (time() - $lastRequest < 2) {
    sleep(2 - (time() - $lastRequest));
}

Common Issues and Solutions

Empty results: Check if JavaScript is required to load the table
Missing data: Verify the correct table selector
Memory issues: Process large tables in smaller chunks
Special characters: Use html_entity_decode() for proper text extraction

Remember to always respect robots.txt and terms of service when scraping websites. Consider using proper user agents and implement reasonable delays between requests.

Table of contents