How do I use Simple HTML DOM to extract table data from a webpage?

Simple HTML DOM is a powerful PHP library for parsing HTML documents and extracting data. This guide shows you how to extract table data from webpages efficiently.

Installation

Using Composer (Recommended)

composer require simplehtmldom/simplehtmldom
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;

Manual Installation

Download simple_html_dom.php and include it:

include_once('simple_html_dom.php');

Basic Table Extraction

1. Load the Webpage

// Direct URL loading
$html = file_get_html('https://example.com/data-table.html');

// Using HtmlWeb client (Composer version)
$client = new HtmlWeb();
$html = $client->load('https://example.com/data-table.html');

// Check if page loaded successfully
if (!$html) {
    die('Error loading page');
}

2. Find and Extract Table Data

// Find the first table
$table = $html->find('table', 0);

if (!$table) {
    die('No table found');
}

$tableData = [];

// Extract all rows
foreach ($table->find('tr') as $row) {
    $rowData = [];

    // Extract cells (both td and th)
    foreach ($row->find('td, th') as $cell) {
        $rowData[] = trim($cell->plaintext);
    }

    // Only add rows with data
    if (!empty($rowData)) {
        $tableData[] = $rowData;
    }
}

Advanced Table Selection

Target Specific Tables

// By ID
$table = $html->find('table#data-table', 0);

// By class
$table = $html->find('table.product-list', 0);

// By attribute
$table = $html->find('table[data-type=pricing]', 0);

// Multiple criteria
$table = $html->find('div.container table.data-grid', 0);

Handle Multiple Tables

$allTables = $html->find('table');

foreach ($allTables as $index => $table) {
    echo "Processing table " . ($index + 1) . "\n";

    $tableData = [];
    foreach ($table->find('tr') as $row) {
        $rowData = [];
        foreach ($row->find('td, th') as $cell) {
            $rowData[] = trim($cell->plaintext);
        }
        if (!empty($rowData)) {
            $tableData[] = $rowData;
        }
    }

    // Process each table's data
    processTableData($tableData, $index);
}

Advanced Data Processing

Separate Headers from Data

$table = $html->find('table', 0);
$headers = [];
$rows = [];

foreach ($table->find('tr') as $index => $row) {
    $rowData = [];

    if ($index === 0) {
        // First row as headers
        foreach ($row->find('th, td') as $cell) {
            $headers[] = trim($cell->plaintext);
        }
    } else {
        // Data rows
        foreach ($row->find('td') as $cell) {
            $rowData[] = trim($cell->plaintext);
        }

        if (!empty($rowData)) {
            $rows[] = array_combine($headers, $rowData);
        }
    }
}

Extract Additional Attributes

foreach ($table->find('tr') as $row) {
    $rowData = [];

    foreach ($row->find('td') as $cell) {
        $cellData = [
            'text' => trim($cell->plaintext),
            'html' => $cell->innertext,
            'class' => $cell->class ?? '',
            'data-value' => $cell->getAttribute('data-value') ?? ''
        ];

        // Extract links within cells
        $link = $cell->find('a', 0);
        if ($link) {
            $cellData['link'] = $link->href;
        }

        $rowData[] = $cellData;
    }

    $tableData[] = $rowData;
}

Complete Example with Error Handling

<?php
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;

function extractTableData($url, $tableSelector = 'table') {
    $client = new HtmlWeb();

    try {
        // Load the webpage
        $html = $client->load($url);

        if (!$html) {
            throw new Exception("Failed to load webpage: $url");
        }

        // Find the table
        $table = $html->find($tableSelector, 0);

        if (!$table) {
            throw new Exception("No table found with selector: $tableSelector");
        }

        $result = [
            'headers' => [],
            'rows' => []
        ];

        $rows = $table->find('tr');

        if (empty($rows)) {
            throw new Exception("No rows found in table");
        }

        foreach ($rows as $index => $row) {
            $rowData = [];

            // Extract cell data
            $cells = $row->find('td, th');
            foreach ($cells as $cell) {
                $rowData[] = trim($cell->plaintext);
            }

            if (!empty($rowData)) {
                if ($index === 0 && $row->find('th')) {
                    // First row with th elements = headers
                    $result['headers'] = $rowData;
                } else {
                    $result['rows'][] = $rowData;
                }
            }
        }

        // Clean up memory
        $html->clear();
        unset($html);

        return $result;

    } catch (Exception $e) {
        error_log("Table extraction error: " . $e->getMessage());
        return false;
    }
}

// Usage example
$url = 'https://example.com/data-table.html';
$data = extractTableData($url, 'table.data-grid');

if ($data) {
    echo "Headers: " . implode(', ', $data['headers']) . "\n";
    echo "Found " . count($data['rows']) . " data rows\n";

    // Display first few rows
    foreach (array_slice($data['rows'], 0, 3) as $row) {
        echo implode(' | ', $row) . "\n";
    }
} else {
    echo "Failed to extract table data\n";
}
?>

Data Export Options

Export to CSV

function exportToCSV($tableData, $filename) {
    $file = fopen($filename, 'w');

    foreach ($tableData as $row) {
        fputcsv($file, $row);
    }

    fclose($file);
}

// Usage
exportToCSV($tableData, 'extracted_data.csv');

Export to JSON

function exportToJSON($headers, $rows, $filename) {
    $jsonData = [];

    foreach ($rows as $row) {
        if (count($headers) === count($row)) {
            $jsonData[] = array_combine($headers, $row);
        }
    }

    file_put_contents($filename, json_encode($jsonData, JSON_PRETTY_PRINT));
}

Best Practices

Error Handling

// Always check if elements exist
if ($html && $table = $html->find('table', 0)) {
    // Process table
} else {
    // Handle error
}

// Validate data before processing
if (!empty($tableData) && is_array($tableData)) {
    // Process data
}

Memory Management

// Always clean up after processing
$html->clear();
unset($html);

// For large datasets, process in chunks
$tables = $html->find('table');
foreach ($tables as $table) {
    processTable($table);
    // Clear processed data
    unset($table);
}

Rate Limiting

// Add delays between requests
sleep(1); // Wait 1 second between requests

// Or use more sophisticated rate limiting
$lastRequest = time();
if (time() - $lastRequest < 2) {
    sleep(2 - (time() - $lastRequest));
}

Common Issues and Solutions

  • Empty results: Check if JavaScript is required to load the table
  • Missing data: Verify the correct table selector
  • Memory issues: Process large tables in smaller chunks
  • Special characters: Use html_entity_decode() for proper text extraction

Remember to always respect robots.txt and terms of service when scraping websites. Consider using proper user agents and implement reasonable delays between requests.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon