How do I extract data from complex HTML tables using PHP?

Extracting data from complex HTML tables in PHP requires careful handling of table structures, including rowspans, colspans, nested tables, and irregular patterns. This guide provides comprehensive techniques using PHP's built-in DOM manipulation classes.

Key Challenges in Complex Table Extraction

Complex HTML tables often include: - Rowspan and colspan attributes that merge cells - Multi-level headers with nested structure - Irregular cell patterns and missing cells - Nested tables within cells - Mixed content types (text, links, images)

Method 1: Basic DOMDocument Approach

Simple Table Extraction

<?php
function extractSimpleTable($html) {
    $doc = new DOMDocument();
    libxml_use_internal_errors(true); // Suppress HTML parsing warnings
    $doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

    $xpath = new DOMXPath($doc);
    $rows = $xpath->query('//table//tr');

    $data = [];
    foreach ($rows as $row) {
        $cells = $xpath->query('.//td | .//th', $row);
        $rowData = [];

        foreach ($cells as $cell) {
            $rowData[] = trim($cell->textContent);
        }

        if (!empty($rowData)) {
            $data[] = $rowData;
        }
    }

    return $data;
}

// Example usage
$html = '<table>
    <tr><th>Name</th><th>Age</th><th>City</th></tr>
    <tr><td>John</td><td>25</td><td>New York</td></tr>
    <tr><td>Jane</td><td>30</td><td>London</td></tr>
</table>';

$result = extractSimpleTable($html);
print_r($result);

Method 2: Advanced Rowspan/Colspan Handling

Comprehensive Table Parser Class

<?php
class ComplexTableParser {
    private $doc;
    private $xpath;
    private $cellMatrix = [];

    public function __construct() {
        $this->doc = new DOMDocument();
        libxml_use_internal_errors(true);
    }

    public function parseTable($html, $tableIndex = 0) {
        $this->doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        $this->xpath = new DOMXPath($this->doc);

        $tables = $this->xpath->query('//table');
        if ($tables->length <= $tableIndex) {
            throw new Exception("Table index {$tableIndex} not found");
        }

        $table = $tables->item($tableIndex);
        return $this->extractTableData($table);
    }

    private function extractTableData($table) {
        $rows = $this->xpath->query('.//tr', $table);
        $this->cellMatrix = [];
        $maxCols = 0;

        foreach ($rows as $rowIndex => $row) {
            $cells = $this->xpath->query('.//td | .//th', $row);
            $colIndex = 0;

            foreach ($cells as $cell) {
                // Skip already occupied cells (from rowspan/colspan)
                while (isset($this->cellMatrix[$rowIndex][$colIndex])) {
                    $colIndex++;
                }

                $rowspan = (int)$cell->getAttribute('rowspan') ?: 1;
                $colspan = (int)$cell->getAttribute('colspan') ?: 1;
                $content = $this->extractCellContent($cell);

                // Fill the matrix for this cell and its spans
                for ($r = $rowIndex; $r < $rowIndex + $rowspan; $r++) {
                    for ($c = $colIndex; $c < $colIndex + $colspan; $c++) {
                        $this->cellMatrix[$r][$c] = ($r === $rowIndex && $c === $colIndex) 
                            ? $content : null;
                        $maxCols = max($maxCols, $c + 1);
                    }
                }

                $colIndex += $colspan;
            }
        }

        return $this->normalizeMatrix($maxCols);
    }

    private function extractCellContent($cell) {
        $content = [
            'text' => trim($cell->textContent),
            'html' => $cell->innerHTML ?? '',
            'attributes' => []
        ];

        // Extract common attributes
        foreach (['class', 'id', 'data-value'] as $attr) {
            if ($cell->hasAttribute($attr)) {
                $content['attributes'][$attr] = $cell->getAttribute($attr);
            }
        }

        // Extract links
        $links = $this->xpath->query('.//a', $cell);
        if ($links->length > 0) {
            $content['links'] = [];
            foreach ($links as $link) {
                $content['links'][] = [
                    'text' => trim($link->textContent),
                    'href' => $link->getAttribute('href')
                ];
            }
        }

        return $content;
    }

    private function normalizeMatrix($maxCols) {
        $normalized = [];

        foreach ($this->cellMatrix as $rowIndex => $row) {
            $normalizedRow = [];
            for ($colIndex = 0; $colIndex < $maxCols; $colIndex++) {
                $normalizedRow[] = $row[$colIndex] ?? null;
            }
            $normalized[] = $normalizedRow;
        }

        return $normalized;
    }

    public function getHeaders() {
        if (empty($this->cellMatrix)) {
            return [];
        }

        $headers = [];
        foreach ($this->cellMatrix[0] as $cell) {
            $headers[] = $cell ? $cell['text'] : '';
        }

        return $headers;
    }

    public function toAssociativeArray() {
        $data = $this->normalizeMatrix(count($this->cellMatrix[0] ?? []));
        $headers = $this->getHeaders();

        $result = [];
        for ($i = 1; $i < count($data); $i++) { // Skip header row
            $row = [];
            foreach ($headers as $index => $header) {
                $row[$header] = $data[$i][$index]['text'] ?? '';
            }
            $result[] = $row;
        }

        return $result;
    }
}

// Example usage with complex table
$complexHtml = '
<table>
    <thead>
        <tr>
            <th rowspan="2">Product</th>
            <th colspan="3">Sales Data</th>
            <th rowspan="2">Total</th>
        </tr>
        <tr>
            <th>Q1</th>
            <th>Q2</th>
            <th>Q3</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Laptop</td>
            <td>150</td>
            <td>200</td>
            <td>180</td>
            <td>530</td>
        </tr>
        <tr>
            <td>Mouse</td>
            <td>300</td>
            <td>250</td>
            <td>400</td>
            <td>950</td>
        </tr>
    </tbody>
</table>';

$parser = new ComplexTableParser();
$tableData = $parser->parseTable($complexHtml);
$associativeData = $parser->toAssociativeArray();

echo "Raw data:\n";
print_r($tableData);

echo "\nAssociative data:\n";
print_r($associativeData);

Method 3: Using Simple HTML DOM Parser (Alternative)

For more complex scenarios, you might want to use the Simple HTML DOM Parser library:

<?php
// First install: composer require simplehtmldom/simplehtmldom

use simplehtmldom\HtmlWeb;
use simplehtmldom\HtmlDocument;

function extractWithSimpleHtmlDom($html) {
    $htmlDom = new HtmlDocument();
    $htmlDom->load($html);

    $tables = $htmlDom->find('table');
    $result = [];

    foreach ($tables as $tableIndex => $table) {
        $tableData = [];
        $rows = $table->find('tr');

        foreach ($rows as $row) {
            $cells = $row->find('td, th');
            $rowData = [];

            foreach ($cells as $cell) {
                $rowData[] = [
                    'text' => trim($cell->plaintext),
                    'html' => $cell->innertext,
                    'rowspan' => $cell->getAttribute('rowspan') ?: 1,
                    'colspan' => $cell->getAttribute('colspan') ?: 1
                ];
            }

            if (!empty($rowData)) {
                $tableData[] = $rowData;
            }
        }

        $result[] = $tableData;
    }

    return $result;
}

Real-World Example: E-commerce Product Table

<?php
function scrapeProductTable($url) {
    // Fetch HTML content
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; ProductScraper/1.0)');

    $html = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200) {
        throw new Exception("HTTP Error: {$httpCode}");
    }

    $parser = new ComplexTableParser();

    try {
        $tableData = $parser->parseTable($html);
        return $parser->toAssociativeArray();
    } catch (Exception $e) {
        error_log("Table parsing error: " . $e->getMessage());
        return [];
    }
}

// Usage
try {
    $products = scrapeProductTable('https://example.com/products');

    foreach ($products as $product) {
        echo "Product: {$product['Name']}, Price: {$product['Price']}\n";
    }
} catch (Exception $e) {
    echo "Error: " . $e->getMessage();
}

Error Handling and Best Practices

Robust Error Handling

<?php
function safeTableExtraction($html, $options = []) {
    $defaultOptions = [
        'encoding' => 'UTF-8',
        'suppress_errors' => true,
        'max_rows' => 1000,
        'timeout' => 30
    ];

    $options = array_merge($defaultOptions, $options);

    try {
        $doc = new DOMDocument();

        if ($options['suppress_errors']) {
            libxml_use_internal_errors(true);
        }

        // Convert encoding if needed
        if ($options['encoding'] !== 'UTF-8') {
            $html = mb_convert_encoding($html, 'UTF-8', $options['encoding']);
        }

        $success = $doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

        if (!$success) {
            throw new Exception('Failed to parse HTML');
        }

        $xpath = new DOMXPath($doc);
        $tables = $xpath->query('//table');

        if ($tables->length === 0) {
            return ['error' => 'No tables found'];
        }

        $results = [];
        foreach ($tables as $index => $table) {
            $rows = $xpath->query('.//tr', $table);

            if ($rows->length > $options['max_rows']) {
                continue; // Skip extremely large tables
            }

            $tableData = [];
            foreach ($rows as $row) {
                $cells = $xpath->query('.//td | .//th', $row);
                $rowData = [];

                foreach ($cells as $cell) {
                    $rowData[] = [
                        'content' => trim($cell->textContent),
                        'tag' => $cell->nodeName,
                        'attributes' => []
                    ];
                }

                if (!empty($rowData)) {
                    $tableData[] = $rowData;
                }
            }

            $results[] = [
                'index' => $index,
                'rows' => count($tableData),
                'data' => $tableData
            ];
        }

        return $results;

    } catch (Exception $e) {
        return ['error' => $e->getMessage()];
    } finally {
        if ($options['suppress_errors']) {
            libxml_clear_errors();
        }
    }
}

Performance Optimization Tips

Use libxml flags to improve parsing performance
Limit row processing for very large tables
Cache parsed results when processing multiple similar tables
Use memory-efficient XPath queries instead of loading entire DOM
Consider streaming parsers for extremely large HTML files

Common Pitfalls and Solutions

Malformed HTML: Use libxml_use_internal_errors(true) to handle errors gracefully
Memory limits: Process large tables in chunks or use streaming approaches
Character encoding: Always specify UTF-8 encoding when loading HTML
Nested tables: Use specific XPath queries to target the correct table level
Empty cells: Check for null values and handle them appropriately in your data structure

This comprehensive approach handles most complex table structures you'll encounter in web scraping scenarios. Always test your extraction logic with real-world examples and implement proper error handling for production use.

Table of contents