Table of contents

How do I extract data from complex HTML tables using PHP?

Extracting data from complex HTML tables in PHP requires careful handling of table structures, including rowspans, colspans, nested tables, and irregular patterns. This guide provides comprehensive techniques using PHP's built-in DOM manipulation classes.

Key Challenges in Complex Table Extraction

Complex HTML tables often include: - Rowspan and colspan attributes that merge cells - Multi-level headers with nested structure - Irregular cell patterns and missing cells - Nested tables within cells - Mixed content types (text, links, images)

Method 1: Basic DOMDocument Approach

Simple Table Extraction

<?php
function extractSimpleTable($html) {
    $doc = new DOMDocument();
    libxml_use_internal_errors(true); // Suppress HTML parsing warnings
    $doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

    $xpath = new DOMXPath($doc);
    $rows = $xpath->query('//table//tr');

    $data = [];
    foreach ($rows as $row) {
        $cells = $xpath->query('.//td | .//th', $row);
        $rowData = [];

        foreach ($cells as $cell) {
            $rowData[] = trim($cell->textContent);
        }

        if (!empty($rowData)) {
            $data[] = $rowData;
        }
    }

    return $data;
}

// Example usage
$html = '<table>
    <tr><th>Name</th><th>Age</th><th>City</th></tr>
    <tr><td>John</td><td>25</td><td>New York</td></tr>
    <tr><td>Jane</td><td>30</td><td>London</td></tr>
</table>';

$result = extractSimpleTable($html);
print_r($result);

Method 2: Advanced Rowspan/Colspan Handling

Comprehensive Table Parser Class

<?php
class ComplexTableParser {
    private $doc;
    private $xpath;
    private $cellMatrix = [];

    public function __construct() {
        $this->doc = new DOMDocument();
        libxml_use_internal_errors(true);
    }

    public function parseTable($html, $tableIndex = 0) {
        $this->doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        $this->xpath = new DOMXPath($this->doc);

        $tables = $this->xpath->query('//table');
        if ($tables->length <= $tableIndex) {
            throw new Exception("Table index {$tableIndex} not found");
        }

        $table = $tables->item($tableIndex);
        return $this->extractTableData($table);
    }

    private function extractTableData($table) {
        $rows = $this->xpath->query('.//tr', $table);
        $this->cellMatrix = [];
        $maxCols = 0;

        foreach ($rows as $rowIndex => $row) {
            $cells = $this->xpath->query('.//td | .//th', $row);
            $colIndex = 0;

            foreach ($cells as $cell) {
                // Skip already occupied cells (from rowspan/colspan)
                while (isset($this->cellMatrix[$rowIndex][$colIndex])) {
                    $colIndex++;
                }

                $rowspan = (int)$cell->getAttribute('rowspan') ?: 1;
                $colspan = (int)$cell->getAttribute('colspan') ?: 1;
                $content = $this->extractCellContent($cell);

                // Fill the matrix for this cell and its spans
                for ($r = $rowIndex; $r < $rowIndex + $rowspan; $r++) {
                    for ($c = $colIndex; $c < $colIndex + $colspan; $c++) {
                        $this->cellMatrix[$r][$c] = ($r === $rowIndex && $c === $colIndex) 
                            ? $content : null;
                        $maxCols = max($maxCols, $c + 1);
                    }
                }

                $colIndex += $colspan;
            }
        }

        return $this->normalizeMatrix($maxCols);
    }

    private function extractCellContent($cell) {
        $content = [
            'text' => trim($cell->textContent),
            'html' => $cell->innerHTML ?? '',
            'attributes' => []
        ];

        // Extract common attributes
        foreach (['class', 'id', 'data-value'] as $attr) {
            if ($cell->hasAttribute($attr)) {
                $content['attributes'][$attr] = $cell->getAttribute($attr);
            }
        }

        // Extract links
        $links = $this->xpath->query('.//a', $cell);
        if ($links->length > 0) {
            $content['links'] = [];
            foreach ($links as $link) {
                $content['links'][] = [
                    'text' => trim($link->textContent),
                    'href' => $link->getAttribute('href')
                ];
            }
        }

        return $content;
    }

    private function normalizeMatrix($maxCols) {
        $normalized = [];

        foreach ($this->cellMatrix as $rowIndex => $row) {
            $normalizedRow = [];
            for ($colIndex = 0; $colIndex < $maxCols; $colIndex++) {
                $normalizedRow[] = $row[$colIndex] ?? null;
            }
            $normalized[] = $normalizedRow;
        }

        return $normalized;
    }

    public function getHeaders() {
        if (empty($this->cellMatrix)) {
            return [];
        }

        $headers = [];
        foreach ($this->cellMatrix[0] as $cell) {
            $headers[] = $cell ? $cell['text'] : '';
        }

        return $headers;
    }

    public function toAssociativeArray() {
        $data = $this->normalizeMatrix(count($this->cellMatrix[0] ?? []));
        $headers = $this->getHeaders();

        $result = [];
        for ($i = 1; $i < count($data); $i++) { // Skip header row
            $row = [];
            foreach ($headers as $index => $header) {
                $row[$header] = $data[$i][$index]['text'] ?? '';
            }
            $result[] = $row;
        }

        return $result;
    }
}

// Example usage with complex table
$complexHtml = '
<table>
    <thead>
        <tr>
            <th rowspan="2">Product</th>
            <th colspan="3">Sales Data</th>
            <th rowspan="2">Total</th>
        </tr>
        <tr>
            <th>Q1</th>
            <th>Q2</th>
            <th>Q3</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Laptop</td>
            <td>150</td>
            <td>200</td>
            <td>180</td>
            <td>530</td>
        </tr>
        <tr>
            <td>Mouse</td>
            <td>300</td>
            <td>250</td>
            <td>400</td>
            <td>950</td>
        </tr>
    </tbody>
</table>';

$parser = new ComplexTableParser();
$tableData = $parser->parseTable($complexHtml);
$associativeData = $parser->toAssociativeArray();

echo "Raw data:\n";
print_r($tableData);

echo "\nAssociative data:\n";
print_r($associativeData);

Method 3: Using Simple HTML DOM Parser (Alternative)

For more complex scenarios, you might want to use the Simple HTML DOM Parser library:

<?php
// First install: composer require simplehtmldom/simplehtmldom

use simplehtmldom\HtmlWeb;
use simplehtmldom\HtmlDocument;

function extractWithSimpleHtmlDom($html) {
    $htmlDom = new HtmlDocument();
    $htmlDom->load($html);

    $tables = $htmlDom->find('table');
    $result = [];

    foreach ($tables as $tableIndex => $table) {
        $tableData = [];
        $rows = $table->find('tr');

        foreach ($rows as $row) {
            $cells = $row->find('td, th');
            $rowData = [];

            foreach ($cells as $cell) {
                $rowData[] = [
                    'text' => trim($cell->plaintext),
                    'html' => $cell->innertext,
                    'rowspan' => $cell->getAttribute('rowspan') ?: 1,
                    'colspan' => $cell->getAttribute('colspan') ?: 1
                ];
            }

            if (!empty($rowData)) {
                $tableData[] = $rowData;
            }
        }

        $result[] = $tableData;
    }

    return $result;
}

Real-World Example: E-commerce Product Table

<?php
function scrapeProductTable($url) {
    // Fetch HTML content
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; ProductScraper/1.0)');

    $html = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200) {
        throw new Exception("HTTP Error: {$httpCode}");
    }

    $parser = new ComplexTableParser();

    try {
        $tableData = $parser->parseTable($html);
        return $parser->toAssociativeArray();
    } catch (Exception $e) {
        error_log("Table parsing error: " . $e->getMessage());
        return [];
    }
}

// Usage
try {
    $products = scrapeProductTable('https://example.com/products');

    foreach ($products as $product) {
        echo "Product: {$product['Name']}, Price: {$product['Price']}\n";
    }
} catch (Exception $e) {
    echo "Error: " . $e->getMessage();
}

Error Handling and Best Practices

Robust Error Handling

<?php
function safeTableExtraction($html, $options = []) {
    $defaultOptions = [
        'encoding' => 'UTF-8',
        'suppress_errors' => true,
        'max_rows' => 1000,
        'timeout' => 30
    ];

    $options = array_merge($defaultOptions, $options);

    try {
        $doc = new DOMDocument();

        if ($options['suppress_errors']) {
            libxml_use_internal_errors(true);
        }

        // Convert encoding if needed
        if ($options['encoding'] !== 'UTF-8') {
            $html = mb_convert_encoding($html, 'UTF-8', $options['encoding']);
        }

        $success = $doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

        if (!$success) {
            throw new Exception('Failed to parse HTML');
        }

        $xpath = new DOMXPath($doc);
        $tables = $xpath->query('//table');

        if ($tables->length === 0) {
            return ['error' => 'No tables found'];
        }

        $results = [];
        foreach ($tables as $index => $table) {
            $rows = $xpath->query('.//tr', $table);

            if ($rows->length > $options['max_rows']) {
                continue; // Skip extremely large tables
            }

            $tableData = [];
            foreach ($rows as $row) {
                $cells = $xpath->query('.//td | .//th', $row);
                $rowData = [];

                foreach ($cells as $cell) {
                    $rowData[] = [
                        'content' => trim($cell->textContent),
                        'tag' => $cell->nodeName,
                        'attributes' => []
                    ];
                }

                if (!empty($rowData)) {
                    $tableData[] = $rowData;
                }
            }

            $results[] = [
                'index' => $index,
                'rows' => count($tableData),
                'data' => $tableData
            ];
        }

        return $results;

    } catch (Exception $e) {
        return ['error' => $e->getMessage()];
    } finally {
        if ($options['suppress_errors']) {
            libxml_clear_errors();
        }
    }
}

Performance Optimization Tips

  1. Use libxml flags to improve parsing performance
  2. Limit row processing for very large tables
  3. Cache parsed results when processing multiple similar tables
  4. Use memory-efficient XPath queries instead of loading entire DOM
  5. Consider streaming parsers for extremely large HTML files

Common Pitfalls and Solutions

  • Malformed HTML: Use libxml_use_internal_errors(true) to handle errors gracefully
  • Memory limits: Process large tables in chunks or use streaming approaches
  • Character encoding: Always specify UTF-8 encoding when loading HTML
  • Nested tables: Use specific XPath queries to target the correct table level
  • Empty cells: Check for null values and handle them appropriately in your data structure

This comprehensive approach handles most complex table structures you'll encounter in web scraping scenarios. Always test your extraction logic with real-world examples and implement proper error handling for production use.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon