Extracting data from complex HTML tables in PHP requires careful handling of table structures, including rowspans, colspans, nested tables, and irregular patterns. This guide provides comprehensive techniques using PHP's built-in DOM manipulation classes.
Key Challenges in Complex Table Extraction
Complex HTML tables often include: - Rowspan and colspan attributes that merge cells - Multi-level headers with nested structure - Irregular cell patterns and missing cells - Nested tables within cells - Mixed content types (text, links, images)
Method 1: Basic DOMDocument Approach
Simple Table Extraction
<?php
function extractSimpleTable($html) {
$doc = new DOMDocument();
libxml_use_internal_errors(true); // Suppress HTML parsing warnings
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
$rows = $xpath->query('//table//tr');
$data = [];
foreach ($rows as $row) {
$cells = $xpath->query('.//td | .//th', $row);
$rowData = [];
foreach ($cells as $cell) {
$rowData[] = trim($cell->textContent);
}
if (!empty($rowData)) {
$data[] = $rowData;
}
}
return $data;
}
// Example usage
$html = '<table>
<tr><th>Name</th><th>Age</th><th>City</th></tr>
<tr><td>John</td><td>25</td><td>New York</td></tr>
<tr><td>Jane</td><td>30</td><td>London</td></tr>
</table>';
$result = extractSimpleTable($html);
print_r($result);
Method 2: Advanced Rowspan/Colspan Handling
Comprehensive Table Parser Class
<?php
class ComplexTableParser {
private $doc;
private $xpath;
private $cellMatrix = [];
public function __construct() {
$this->doc = new DOMDocument();
libxml_use_internal_errors(true);
}
public function parseTable($html, $tableIndex = 0) {
$this->doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$this->xpath = new DOMXPath($this->doc);
$tables = $this->xpath->query('//table');
if ($tables->length <= $tableIndex) {
throw new Exception("Table index {$tableIndex} not found");
}
$table = $tables->item($tableIndex);
return $this->extractTableData($table);
}
private function extractTableData($table) {
$rows = $this->xpath->query('.//tr', $table);
$this->cellMatrix = [];
$maxCols = 0;
foreach ($rows as $rowIndex => $row) {
$cells = $this->xpath->query('.//td | .//th', $row);
$colIndex = 0;
foreach ($cells as $cell) {
// Skip already occupied cells (from rowspan/colspan)
while (isset($this->cellMatrix[$rowIndex][$colIndex])) {
$colIndex++;
}
$rowspan = (int)$cell->getAttribute('rowspan') ?: 1;
$colspan = (int)$cell->getAttribute('colspan') ?: 1;
$content = $this->extractCellContent($cell);
// Fill the matrix for this cell and its spans
for ($r = $rowIndex; $r < $rowIndex + $rowspan; $r++) {
for ($c = $colIndex; $c < $colIndex + $colspan; $c++) {
$this->cellMatrix[$r][$c] = ($r === $rowIndex && $c === $colIndex)
? $content : null;
$maxCols = max($maxCols, $c + 1);
}
}
$colIndex += $colspan;
}
}
return $this->normalizeMatrix($maxCols);
}
private function extractCellContent($cell) {
$content = [
'text' => trim($cell->textContent),
'html' => $cell->innerHTML ?? '',
'attributes' => []
];
// Extract common attributes
foreach (['class', 'id', 'data-value'] as $attr) {
if ($cell->hasAttribute($attr)) {
$content['attributes'][$attr] = $cell->getAttribute($attr);
}
}
// Extract links
$links = $this->xpath->query('.//a', $cell);
if ($links->length > 0) {
$content['links'] = [];
foreach ($links as $link) {
$content['links'][] = [
'text' => trim($link->textContent),
'href' => $link->getAttribute('href')
];
}
}
return $content;
}
private function normalizeMatrix($maxCols) {
$normalized = [];
foreach ($this->cellMatrix as $rowIndex => $row) {
$normalizedRow = [];
for ($colIndex = 0; $colIndex < $maxCols; $colIndex++) {
$normalizedRow[] = $row[$colIndex] ?? null;
}
$normalized[] = $normalizedRow;
}
return $normalized;
}
public function getHeaders() {
if (empty($this->cellMatrix)) {
return [];
}
$headers = [];
foreach ($this->cellMatrix[0] as $cell) {
$headers[] = $cell ? $cell['text'] : '';
}
return $headers;
}
public function toAssociativeArray() {
$data = $this->normalizeMatrix(count($this->cellMatrix[0] ?? []));
$headers = $this->getHeaders();
$result = [];
for ($i = 1; $i < count($data); $i++) { // Skip header row
$row = [];
foreach ($headers as $index => $header) {
$row[$header] = $data[$i][$index]['text'] ?? '';
}
$result[] = $row;
}
return $result;
}
}
// Example usage with complex table
$complexHtml = '
<table>
<thead>
<tr>
<th rowspan="2">Product</th>
<th colspan="3">Sales Data</th>
<th rowspan="2">Total</th>
</tr>
<tr>
<th>Q1</th>
<th>Q2</th>
<th>Q3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Laptop</td>
<td>150</td>
<td>200</td>
<td>180</td>
<td>530</td>
</tr>
<tr>
<td>Mouse</td>
<td>300</td>
<td>250</td>
<td>400</td>
<td>950</td>
</tr>
</tbody>
</table>';
$parser = new ComplexTableParser();
$tableData = $parser->parseTable($complexHtml);
$associativeData = $parser->toAssociativeArray();
echo "Raw data:\n";
print_r($tableData);
echo "\nAssociative data:\n";
print_r($associativeData);
Method 3: Using Simple HTML DOM Parser (Alternative)
For more complex scenarios, you might want to use the Simple HTML DOM Parser library:
<?php
// First install: composer require simplehtmldom/simplehtmldom
use simplehtmldom\HtmlWeb;
use simplehtmldom\HtmlDocument;
function extractWithSimpleHtmlDom($html) {
$htmlDom = new HtmlDocument();
$htmlDom->load($html);
$tables = $htmlDom->find('table');
$result = [];
foreach ($tables as $tableIndex => $table) {
$tableData = [];
$rows = $table->find('tr');
foreach ($rows as $row) {
$cells = $row->find('td, th');
$rowData = [];
foreach ($cells as $cell) {
$rowData[] = [
'text' => trim($cell->plaintext),
'html' => $cell->innertext,
'rowspan' => $cell->getAttribute('rowspan') ?: 1,
'colspan' => $cell->getAttribute('colspan') ?: 1
];
}
if (!empty($rowData)) {
$tableData[] = $rowData;
}
}
$result[] = $tableData;
}
return $result;
}
Real-World Example: E-commerce Product Table
<?php
function scrapeProductTable($url) {
// Fetch HTML content
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; ProductScraper/1.0)');
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
throw new Exception("HTTP Error: {$httpCode}");
}
$parser = new ComplexTableParser();
try {
$tableData = $parser->parseTable($html);
return $parser->toAssociativeArray();
} catch (Exception $e) {
error_log("Table parsing error: " . $e->getMessage());
return [];
}
}
// Usage
try {
$products = scrapeProductTable('https://example.com/products');
foreach ($products as $product) {
echo "Product: {$product['Name']}, Price: {$product['Price']}\n";
}
} catch (Exception $e) {
echo "Error: " . $e->getMessage();
}
Error Handling and Best Practices
Robust Error Handling
<?php
function safeTableExtraction($html, $options = []) {
$defaultOptions = [
'encoding' => 'UTF-8',
'suppress_errors' => true,
'max_rows' => 1000,
'timeout' => 30
];
$options = array_merge($defaultOptions, $options);
try {
$doc = new DOMDocument();
if ($options['suppress_errors']) {
libxml_use_internal_errors(true);
}
// Convert encoding if needed
if ($options['encoding'] !== 'UTF-8') {
$html = mb_convert_encoding($html, 'UTF-8', $options['encoding']);
}
$success = $doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
if (!$success) {
throw new Exception('Failed to parse HTML');
}
$xpath = new DOMXPath($doc);
$tables = $xpath->query('//table');
if ($tables->length === 0) {
return ['error' => 'No tables found'];
}
$results = [];
foreach ($tables as $index => $table) {
$rows = $xpath->query('.//tr', $table);
if ($rows->length > $options['max_rows']) {
continue; // Skip extremely large tables
}
$tableData = [];
foreach ($rows as $row) {
$cells = $xpath->query('.//td | .//th', $row);
$rowData = [];
foreach ($cells as $cell) {
$rowData[] = [
'content' => trim($cell->textContent),
'tag' => $cell->nodeName,
'attributes' => []
];
}
if (!empty($rowData)) {
$tableData[] = $rowData;
}
}
$results[] = [
'index' => $index,
'rows' => count($tableData),
'data' => $tableData
];
}
return $results;
} catch (Exception $e) {
return ['error' => $e->getMessage()];
} finally {
if ($options['suppress_errors']) {
libxml_clear_errors();
}
}
}
Performance Optimization Tips
- Use libxml flags to improve parsing performance
- Limit row processing for very large tables
- Cache parsed results when processing multiple similar tables
- Use memory-efficient XPath queries instead of loading entire DOM
- Consider streaming parsers for extremely large HTML files
Common Pitfalls and Solutions
- Malformed HTML: Use
libxml_use_internal_errors(true)
to handle errors gracefully - Memory limits: Process large tables in chunks or use streaming approaches
- Character encoding: Always specify UTF-8 encoding when loading HTML
- Nested tables: Use specific XPath queries to target the correct table level
- Empty cells: Check for null values and handle them appropriately in your data structure
This comprehensive approach handles most complex table structures you'll encounter in web scraping scenarios. Always test your extraction logic with real-world examples and implement proper error handling for production use.