Simple HTML DOM is a powerful PHP library for parsing HTML documents and extracting data. This guide shows you how to extract table data from webpages efficiently.
Installation
Using Composer (Recommended)
composer require simplehtmldom/simplehtmldom
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;
Manual Installation
Download simple_html_dom.php
and include it:
include_once('simple_html_dom.php');
Basic Table Extraction
1. Load the Webpage
// Direct URL loading
$html = file_get_html('https://example.com/data-table.html');
// Using HtmlWeb client (Composer version)
$client = new HtmlWeb();
$html = $client->load('https://example.com/data-table.html');
// Check if page loaded successfully
if (!$html) {
die('Error loading page');
}
2. Find and Extract Table Data
// Find the first table
$table = $html->find('table', 0);
if (!$table) {
die('No table found');
}
$tableData = [];
// Extract all rows
foreach ($table->find('tr') as $row) {
$rowData = [];
// Extract cells (both td and th)
foreach ($row->find('td, th') as $cell) {
$rowData[] = trim($cell->plaintext);
}
// Only add rows with data
if (!empty($rowData)) {
$tableData[] = $rowData;
}
}
Advanced Table Selection
Target Specific Tables
// By ID
$table = $html->find('table#data-table', 0);
// By class
$table = $html->find('table.product-list', 0);
// By attribute
$table = $html->find('table[data-type=pricing]', 0);
// Multiple criteria
$table = $html->find('div.container table.data-grid', 0);
Handle Multiple Tables
$allTables = $html->find('table');
foreach ($allTables as $index => $table) {
echo "Processing table " . ($index + 1) . "\n";
$tableData = [];
foreach ($table->find('tr') as $row) {
$rowData = [];
foreach ($row->find('td, th') as $cell) {
$rowData[] = trim($cell->plaintext);
}
if (!empty($rowData)) {
$tableData[] = $rowData;
}
}
// Process each table's data
processTableData($tableData, $index);
}
Advanced Data Processing
Separate Headers from Data
$table = $html->find('table', 0);
$headers = [];
$rows = [];
foreach ($table->find('tr') as $index => $row) {
$rowData = [];
if ($index === 0) {
// First row as headers
foreach ($row->find('th, td') as $cell) {
$headers[] = trim($cell->plaintext);
}
} else {
// Data rows
foreach ($row->find('td') as $cell) {
$rowData[] = trim($cell->plaintext);
}
if (!empty($rowData)) {
$rows[] = array_combine($headers, $rowData);
}
}
}
Extract Additional Attributes
foreach ($table->find('tr') as $row) {
$rowData = [];
foreach ($row->find('td') as $cell) {
$cellData = [
'text' => trim($cell->plaintext),
'html' => $cell->innertext,
'class' => $cell->class ?? '',
'data-value' => $cell->getAttribute('data-value') ?? ''
];
// Extract links within cells
$link = $cell->find('a', 0);
if ($link) {
$cellData['link'] = $link->href;
}
$rowData[] = $cellData;
}
$tableData[] = $rowData;
}
Complete Example with Error Handling
<?php
require 'vendor/autoload.php';
use simplehtmldom\HtmlWeb;
function extractTableData($url, $tableSelector = 'table') {
$client = new HtmlWeb();
try {
// Load the webpage
$html = $client->load($url);
if (!$html) {
throw new Exception("Failed to load webpage: $url");
}
// Find the table
$table = $html->find($tableSelector, 0);
if (!$table) {
throw new Exception("No table found with selector: $tableSelector");
}
$result = [
'headers' => [],
'rows' => []
];
$rows = $table->find('tr');
if (empty($rows)) {
throw new Exception("No rows found in table");
}
foreach ($rows as $index => $row) {
$rowData = [];
// Extract cell data
$cells = $row->find('td, th');
foreach ($cells as $cell) {
$rowData[] = trim($cell->plaintext);
}
if (!empty($rowData)) {
if ($index === 0 && $row->find('th')) {
// First row with th elements = headers
$result['headers'] = $rowData;
} else {
$result['rows'][] = $rowData;
}
}
}
// Clean up memory
$html->clear();
unset($html);
return $result;
} catch (Exception $e) {
error_log("Table extraction error: " . $e->getMessage());
return false;
}
}
// Usage example
$url = 'https://example.com/data-table.html';
$data = extractTableData($url, 'table.data-grid');
if ($data) {
echo "Headers: " . implode(', ', $data['headers']) . "\n";
echo "Found " . count($data['rows']) . " data rows\n";
// Display first few rows
foreach (array_slice($data['rows'], 0, 3) as $row) {
echo implode(' | ', $row) . "\n";
}
} else {
echo "Failed to extract table data\n";
}
?>
Data Export Options
Export to CSV
function exportToCSV($tableData, $filename) {
$file = fopen($filename, 'w');
foreach ($tableData as $row) {
fputcsv($file, $row);
}
fclose($file);
}
// Usage
exportToCSV($tableData, 'extracted_data.csv');
Export to JSON
function exportToJSON($headers, $rows, $filename) {
$jsonData = [];
foreach ($rows as $row) {
if (count($headers) === count($row)) {
$jsonData[] = array_combine($headers, $row);
}
}
file_put_contents($filename, json_encode($jsonData, JSON_PRETTY_PRINT));
}
Best Practices
Error Handling
// Always check if elements exist
if ($html && $table = $html->find('table', 0)) {
// Process table
} else {
// Handle error
}
// Validate data before processing
if (!empty($tableData) && is_array($tableData)) {
// Process data
}
Memory Management
// Always clean up after processing
$html->clear();
unset($html);
// For large datasets, process in chunks
$tables = $html->find('table');
foreach ($tables as $table) {
processTable($table);
// Clear processed data
unset($table);
}
Rate Limiting
// Add delays between requests
sleep(1); // Wait 1 second between requests
// Or use more sophisticated rate limiting
$lastRequest = time();
if (time() - $lastRequest < 2) {
sleep(2 - (time() - $lastRequest));
}
Common Issues and Solutions
- Empty results: Check if JavaScript is required to load the table
- Missing data: Verify the correct table selector
- Memory issues: Process large tables in smaller chunks
- Special characters: Use
html_entity_decode()
for proper text extraction
Remember to always respect robots.txt and terms of service when scraping websites. Consider using proper user agents and implement reasonable delays between requests.