Table of contents

How do I use Simple HTML DOM to extract table data from a webpage?

Simple HTML DOM is a powerful PHP library for parsing HTML documents and extracting data. This guide shows you how to extract table data from webpages efficiently.

Installation

Using Composer (Recommended)

composer require simplehtmldom/simplehtmldom
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;

Manual Installation

Download simple_html_dom.php and include it:

include_once('simple_html_dom.php');

Basic Table Extraction

1. Load the Webpage

// Direct URL loading
$html = file_get_html('https://example.com/data-table.html');

// Using HtmlWeb client (Composer version)
$client = new HtmlWeb();
$html = $client->load('https://example.com/data-table.html');

// Check if page loaded successfully
if (!$html) {
    die('Error loading page');
}

2. Find and Extract Table Data

// Find the first table
$table = $html->find('table', 0);

if (!$table) {
    die('No table found');
}

$tableData = [];

// Extract all rows
foreach ($table->find('tr') as $row) {
    $rowData = [];

    // Extract cells (both td and th)
    foreach ($row->find('td, th') as $cell) {
        $rowData[] = trim($cell->plaintext);
    }

    // Only add rows with data
    if (!empty($rowData)) {
        $tableData[] = $rowData;
    }
}

Advanced Table Selection

Target Specific Tables

// By ID
$table = $html->find('table#data-table', 0);

// By class
$table = $html->find('table.product-list', 0);

// By attribute
$table = $html->find('table[data-type=pricing]', 0);

// Multiple criteria
$table = $html->find('div.container table.data-grid', 0);

Handle Multiple Tables

$allTables = $html->find('table');

foreach ($allTables as $index => $table) {
    echo "Processing table " . ($index + 1) . "\n";

    $tableData = [];
    foreach ($table->find('tr') as $row) {
        $rowData = [];
        foreach ($row->find('td, th') as $cell) {
            $rowData[] = trim($cell->plaintext);
        }
        if (!empty($rowData)) {
            $tableData[] = $rowData;
        }
    }

    // Process each table's data
    processTableData($tableData, $index);
}

Advanced Data Processing

Separate Headers from Data

$table = $html->find('table', 0);
$headers = [];
$rows = [];

foreach ($table->find('tr') as $index => $row) {
    $rowData = [];

    if ($index === 0) {
        // First row as headers
        foreach ($row->find('th, td') as $cell) {
            $headers[] = trim($cell->plaintext);
        }
    } else {
        // Data rows
        foreach ($row->find('td') as $cell) {
            $rowData[] = trim($cell->plaintext);
        }

        if (!empty($rowData)) {
            $rows[] = array_combine($headers, $rowData);
        }
    }
}

Extract Additional Attributes

foreach ($table->find('tr') as $row) {
    $rowData = [];

    foreach ($row->find('td') as $cell) {
        $cellData = [
            'text' => trim($cell->plaintext),
            'html' => $cell->innertext,
            'class' => $cell->class ?? '',
            'data-value' => $cell->getAttribute('data-value') ?? ''
        ];

        // Extract links within cells
        $link = $cell->find('a', 0);
        if ($link) {
            $cellData['link'] = $link->href;
        }

        $rowData[] = $cellData;
    }

    $tableData[] = $rowData;
}

Complete Example with Error Handling

<?php
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;

function extractTableData($url, $tableSelector = 'table') {
    $client = new HtmlWeb();

    try {
        // Load the webpage
        $html = $client->load($url);

        if (!$html) {
            throw new Exception("Failed to load webpage: $url");
        }

        // Find the table
        $table = $html->find($tableSelector, 0);

        if (!$table) {
            throw new Exception("No table found with selector: $tableSelector");
        }

        $result = [
            'headers' => [],
            'rows' => []
        ];

        $rows = $table->find('tr');

        if (empty($rows)) {
            throw new Exception("No rows found in table");
        }

        foreach ($rows as $index => $row) {
            $rowData = [];

            // Extract cell data
            $cells = $row->find('td, th');
            foreach ($cells as $cell) {
                $rowData[] = trim($cell->plaintext);
            }

            if (!empty($rowData)) {
                if ($index === 0 && $row->find('th')) {
                    // First row with th elements = headers
                    $result['headers'] = $rowData;
                } else {
                    $result['rows'][] = $rowData;
                }
            }
        }

        // Clean up memory
        $html->clear();
        unset($html);

        return $result;

    } catch (Exception $e) {
        error_log("Table extraction error: " . $e->getMessage());
        return false;
    }
}

// Usage example
$url = 'https://example.com/data-table.html';
$data = extractTableData($url, 'table.data-grid');

if ($data) {
    echo "Headers: " . implode(', ', $data['headers']) . "\n";
    echo "Found " . count($data['rows']) . " data rows\n";

    // Display first few rows
    foreach (array_slice($data['rows'], 0, 3) as $row) {
        echo implode(' | ', $row) . "\n";
    }
} else {
    echo "Failed to extract table data\n";
}
?>

Data Export Options

Export to CSV

function exportToCSV($tableData, $filename) {
    $file = fopen($filename, 'w');

    foreach ($tableData as $row) {
        fputcsv($file, $row);
    }

    fclose($file);
}

// Usage
exportToCSV($tableData, 'extracted_data.csv');

Export to JSON

function exportToJSON($headers, $rows, $filename) {
    $jsonData = [];

    foreach ($rows as $row) {
        if (count($headers) === count($row)) {
            $jsonData[] = array_combine($headers, $row);
        }
    }

    file_put_contents($filename, json_encode($jsonData, JSON_PRETTY_PRINT));
}

Best Practices

Error Handling

// Always check if elements exist
if ($html && $table = $html->find('table', 0)) {
    // Process table
} else {
    // Handle error
}

// Validate data before processing
if (!empty($tableData) && is_array($tableData)) {
    // Process data
}

Memory Management

// Always clean up after processing
$html->clear();
unset($html);

// For large datasets, process in chunks
$tables = $html->find('table');
foreach ($tables as $table) {
    processTable($table);
    // Clear processed data
    unset($table);
}

Rate Limiting

// Add delays between requests
sleep(1); // Wait 1 second between requests

// Or use more sophisticated rate limiting
$lastRequest = time();
if (time() - $lastRequest < 2) {
    sleep(2 - (time() - $lastRequest));
}

Common Issues and Solutions

  • Empty results: Check if JavaScript is required to load the table
  • Missing data: Verify the correct table selector
  • Memory issues: Process large tables in smaller chunks
  • Special characters: Use html_entity_decode() for proper text extraction

Remember to always respect robots.txt and terms of service when scraping websites. Consider using proper user agents and implement reasonable delays between requests.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon