How do I extract data from complex HTML tables using PHP?

Extracting data from complex HTML tables using PHP can be a challenging task, especially if the table structure includes multiple rowspans, colspans, nested tables, or irregular cell patterns. To accomplish this, you typically need to parse the HTML content and navigate through the table's elements. Here's a step-by-step guide to extracting data from complex HTML tables in PHP:

Steps to Extract Data from Complex HTML Tables

  1. Load the HTML Document: Use PHP's DOMDocument class to load the HTML content.

  2. Parse the Table Structure: Create a DOMXPath instance to query and traverse the DOM more easily.

  3. Handle Rows and Cells: Iterate over the rows and cells, taking into account any rowspan or colspan attributes.

  4. Extract and Store Data: As you navigate through the cells, extract the text or HTML content and store it in a structured format like an array or an object.

Example Code

Below is an example PHP script that demonstrates how to extract data from a complex HTML table:

<?php

$html = <<<EOD
<table>
    <tr>
        <th rowspan="2">Name</th>
        <th colspan="2">Contact</th>
    </tr>
    <tr>
        <th>Email</th>
        <th>Phone</th>
    </tr>
    <tr>
        <td>John Doe</td>
        <td>john.doe@example.com</td>
        <td>123-456-7890</td>
    </tr>
    <!-- Add more rows as needed -->
</table>
EOD;

// Load the HTML content into a DOMDocument
$doc = new DOMDocument();
@$doc->loadHTML($html);

// Create DOMXPath instance
$xpath = new DOMXPath($doc);

// Query the table rows
$rows = $xpath->query('//table/tr');

// Initialize an array to store the extracted data
$data = [];

// Track the current index for rowspan and colspan
$currentRow = 0;
$colspanIndex = [];

foreach ($rows as $row) {
    // Initialize an array to store row data
    $rowData = [];

    // Get all cells in the current row (th or td)
    $cells = $xpath->query('th|td', $row);

    $currentColumn = 0;
    foreach ($cells as $cell) {
        // Handle rowspan and colspan
        $rowspan = $cell->getAttribute('rowspan') ?: 1;
        $colspan = $cell->getAttribute('colspan') ?: 1;

        // Find the correct column index accounting for previous colspans
        while (isset($colspanIndex[$currentRow][$currentColumn])) {
            $currentColumn++;
        }

        // Get the cell content
        $cellContent = trim($cell->textContent);

        for ($r = $currentRow; $r < $currentRow + $rowspan; $r++) {
            for ($c = $currentColumn; $c < $currentColumn + $colspan; $c++) {
                $colspanIndex[$r][$c] = true;
                // Set the cell content for the first cell of a rowspan/colspan only
                if ($r == $currentRow && $c == $currentColumn) {
                    $rowData[$c] = $cellContent;
                }
            }
        }

        $currentColumn += $colspan;
    }

    // Add the row data to the main data array
    $data[] = $rowData;
    $currentRow++;
}

// Output the extracted data
print_r($data);

Explanation

  • The $html variable contains the HTML content of the table you want to scrape. In a real-world scenario, you might load this from a webpage using cURL or file_get_contents.
  • The DOMDocument instance ($doc) is used to load the HTML, and the DOMXPath instance ($xpath) is used to navigate the DOM.
  • We query for all the rows in the table and then iterate over each row and its cells.
  • We handle rowspan and colspan by tracking the current row and column indices and skipping cells that are part of a span.
  • We accumulate the data in an array, which can then be used for further processing or storage.

Tips for Handling Complex Tables

  • Nested Tables: If the table contains nested tables, you will need to adjust your XPath queries to target the correct level of nesting.
  • Irregular Patterns: For tables with irregular cell patterns, you may need to incorporate additional logic to handle exceptions or to correctly align the data in your output structure.
  • HTML Entity Decoding: If the table content includes HTML entities, you may need to decode them using html_entity_decode().

Remember that web scraping should be done responsibly and in compliance with the terms of service or robots.txt of the website you are scraping.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon