How do I scrape data from paginated content?

Scraping data from paginated content is one of the most common challenges in web scraping. When websites split large datasets across multiple pages, you need to systematically navigate through each page to collect all available data. This guide will show you how to effectively scrape paginated content using Simple HTML DOM in PHP, along with alternative approaches for different scenarios.

Understanding Pagination Patterns

Before diving into code, it's crucial to understand the different types of pagination you'll encounter:

1. Numbered Pagination

The most common type with numbered links (1, 2, 3, ..., Next)

2. Next/Previous Pagination

Simple navigation with only "Next" and "Previous" buttons

3. Load More Pagination

Pages that load additional content dynamically when clicking a "Load More" button

4. Infinite Scroll Pagination

Content loads automatically as you scroll down the page

Basic Pagination Scraping with Simple HTML DOM

Here's a fundamental approach to scraping paginated content using PHP and Simple HTML DOM:

<?php
require_once('simple_html_dom.php');

function scrapePaginatedData($baseUrl, $maxPages = 10) {
    $allData = [];
    $currentPage = 1;

    while ($currentPage <= $maxPages) {
        // Construct the URL for the current page
        $url = $baseUrl . "?page=" . $currentPage;

        // Create DOM object
        $html = file_get_html($url);

        if (!$html) {
            echo "Failed to load page: $url\n";
            break;
        }

        // Extract data from current page
        $pageData = extractDataFromPage($html);

        // If no data found, we might have reached the end
        if (empty($pageData)) {
            echo "No data found on page $currentPage. Stopping.\n";
            break;
        }

        // Add data to our collection
        $allData = array_merge($allData, $pageData);

        echo "Scraped page $currentPage: " . count($pageData) . " items\n";

        // Clean up memory
        $html->clear();
        unset($html);

        // Add delay to be respectful to the server
        sleep(1);

        $currentPage++;
    }

    return $allData;
}

function extractDataFromPage($html) {
    $data = [];

    // Example: Extract product information
    foreach ($html->find('.product-item') as $item) {
        $product = [
            'name' => $item->find('.product-name', 0)->plaintext ?? '',
            'price' => $item->find('.product-price', 0)->plaintext ?? '',
            'url' => $item->find('a', 0)->href ?? ''
        ];

        if (!empty($product['name'])) {
            $data[] = $product;
        }
    }

    return $data;
}

// Usage
$baseUrl = "https://example-shop.com/products";
$scrapedData = scrapePaginatedData($baseUrl, 50);

echo "Total items scraped: " . count($scrapedData) . "\n";
?>

Advanced Pagination Detection

For more robust scraping, implement automatic pagination detection:

<?php
function scrapeWithAutoPagination($baseUrl) {
    $allData = [];
    $currentUrl = $baseUrl;
    $visitedUrls = [];

    while ($currentUrl && !in_array($currentUrl, $visitedUrls)) {
        $visitedUrls[] = $currentUrl;

        echo "Scraping: $currentUrl\n";

        $html = file_get_html($currentUrl);
        if (!$html) break;

        // Extract data from current page
        $pageData = extractDataFromPage($html);
        $allData = array_merge($allData, $pageData);

        // Find next page URL
        $nextUrl = findNextPageUrl($html, $currentUrl);

        $html->clear();
        unset($html);

        $currentUrl = $nextUrl;
        sleep(1); // Rate limiting
    }

    return $allData;
}

function findNextPageUrl($html, $currentUrl) {
    // Method 1: Look for "Next" button
    $nextLink = $html->find('a.next', 0);
    if ($nextLink && $nextLink->href) {
        return makeAbsoluteUrl($nextLink->href, $currentUrl);
    }

    // Method 2: Look for numbered pagination
    $paginationLinks = $html->find('.pagination a');
    foreach ($paginationLinks as $link) {
        if (stripos($link->plaintext, 'next') !== false) {
            return makeAbsoluteUrl($link->href, $currentUrl);
        }
    }

    // Method 3: Look for rel="next" attribute
    $nextRel = $html->find('a[rel="next"]', 0);
    if ($nextRel && $nextRel->href) {
        return makeAbsoluteUrl($nextRel->href, $currentUrl);
    }

    return null;
}

function makeAbsoluteUrl($relativeUrl, $baseUrl) {
    if (filter_var($relativeUrl, FILTER_VALIDATE_URL)) {
        return $relativeUrl;
    }

    $base = parse_url($baseUrl);
    $scheme = $base['scheme'] ?? 'https';
    $host = $base['host'] ?? '';

    if (strpos($relativeUrl, '/') === 0) {
        return $scheme . '://' . $host . $relativeUrl;
    }

    $path = dirname($base['path'] ?? '') . '/';
    return $scheme . '://' . $host . $path . $relativeUrl;
}
?>

Handling Different Pagination Patterns

Query Parameter Pagination

Many sites use query parameters for pagination:

<?php
function scrapeQueryParamPagination($baseUrl) {
    $allData = [];
    $page = 1;
    $hasNextPage = true;

    while ($hasNextPage) {
        $url = $baseUrl . "?page=" . $page;
        $html = file_get_html($url);

        if (!$html) break;

        $pageData = extractDataFromPage($html);

        if (empty($pageData)) {
            $hasNextPage = false;
        } else {
            $allData = array_merge($allData, $pageData);
            echo "Page $page: " . count($pageData) . " items\n";
        }

        $html->clear();
        unset($html);
        $page++;
        sleep(1);
    }

    return $allData;
}
?>

Offset-Based Pagination

Some APIs and websites use offset/limit parameters:

<?php
function scrapeOffsetPagination($baseUrl, $limit = 20) {
    $allData = [];
    $offset = 0;
    $hasMoreData = true;

    while ($hasMoreData) {
        $url = $baseUrl . "?limit=" . $limit . "&offset=" . $offset;
        $html = file_get_html($url);

        if (!$html) break;

        $pageData = extractDataFromPage($html);

        if (count($pageData) < $limit) {
            $hasMoreData = false;
        }

        if (!empty($pageData)) {
            $allData = array_merge($allData, $pageData);
            echo "Offset $offset: " . count($pageData) . " items\n";
        } else {
            $hasMoreData = false;
        }

        $html->clear();
        unset($html);
        $offset += $limit;
        sleep(1);
    }

    return $allData;
}
?>

Advanced Techniques and Best Practices

Using cURL with Cookies for Session Management

For websites that require session management:

<?php
function scrapeWithSession($baseUrl) {
    $cookieFile = tempnam(sys_get_temp_dir(), 'cookies');
    $allData = [];
    $page = 1;

    while (true) {
        $url = $baseUrl . "?page=" . $page;

        // Use cURL with cookie support
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
        curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; Web Scraper)');
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

        $htmlContent = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        if ($httpCode !== 200 || !$htmlContent) {
            break;
        }

        $html = str_get_html($htmlContent);
        if (!$html) break;

        $pageData = extractDataFromPage($html);

        if (empty($pageData)) {
            break;
        }

        $allData = array_merge($allData, $pageData);
        $html->clear();
        unset($html);

        $page++;
        sleep(2); // Longer delay for respectful scraping
    }

    unlink($cookieFile); // Clean up
    return $allData;
}
?>

Error Handling and Retry Logic

Implement robust error handling for production scraping:

<?php
function scrapeWithRetry($url, $maxRetries = 3) {
    $retries = 0;

    while ($retries < $maxRetries) {
        try {
            $html = file_get_html($url);

            if ($html) {
                return $html;
            }

            throw new Exception("Failed to load HTML from: $url");

        } catch (Exception $e) {
            $retries++;
            echo "Attempt $retries failed: " . $e->getMessage() . "\n";

            if ($retries < $maxRetries) {
                echo "Retrying in " . ($retries * 2) . " seconds...\n";
                sleep($retries * 2); // Exponential backoff
            }
        }
    }

    return false;
}
?>

Handling JavaScript-Heavy Pagination

While Simple HTML DOM cannot execute JavaScript, you can still work with JavaScript-heavy sites by understanding their API endpoints:

<?php
function scrapeAjaxPagination($baseApiUrl) {
    $allData = [];
    $page = 1;

    while (true) {
        $apiUrl = $baseApiUrl . "?page=" . $page . "&format=json";

        $jsonData = file_get_contents($apiUrl);
        if (!$jsonData) break;

        $data = json_decode($jsonData, true);

        if (empty($data['items'])) {
            break;
        }

        $allData = array_merge($allData, $data['items']);
        echo "API Page $page: " . count($data['items']) . " items\n";

        // Check if there are more pages
        if (!isset($data['has_next']) || !$data['has_next']) {
            break;
        }

        $page++;
        sleep(1);
    }

    return $allData;
}
?>

For complex JavaScript-rendered pagination, consider using Puppeteer for handling dynamic content or navigating to different pages programmatically.

Performance Optimization Tips

1. Memory Management

// Always clear DOM objects to prevent memory leaks
$html->clear();
unset($html);

2. Batch Processing

function processPagesInBatches($urls, $batchSize = 10) {
    $batches = array_chunk($urls, $batchSize);
    $allData = [];

    foreach ($batches as $batch) {
        $batchData = [];

        foreach ($batch as $url) {
            $html = file_get_html($url);
            if ($html) {
                $batchData = array_merge($batchData, extractDataFromPage($html));
                $html->clear();
                unset($html);
            }
        }

        $allData = array_merge($allData, $batchData);

        // Process batch data (save to database, etc.)
        processBatchData($batchData);

        sleep(2); // Rest between batches
    }

    return $allData;
}

3. Rate Limiting

class RateLimiter {
    private $requests = 0;
    private $startTime;
    private $maxRequests;
    private $timeWindow;

    public function __construct($maxRequests = 60, $timeWindow = 60) {
        $this->maxRequests = $maxRequests;
        $this->timeWindow = $timeWindow;
        $this->startTime = time();
    }

    public function throttle() {
        $this->requests++;

        if ($this->requests >= $this->maxRequests) {
            $elapsed = time() - $this->startTime;

            if ($elapsed < $this->timeWindow) {
                $sleepTime = $this->timeWindow - $elapsed;
                echo "Rate limit reached. Sleeping for $sleepTime seconds...\n";
                sleep($sleepTime);
            }

            // Reset counter
            $this->requests = 0;
            $this->startTime = time();
        }
    }
}

JavaScript Pagination with API Endpoints

Many modern websites use AJAX calls for pagination. You can intercept these calls:

# Use browser developer tools to find API endpoints
# Network tab -> Filter by XHR/Fetch -> Look for pagination requests
# Example URL: https://api.example.com/products?page=2&limit=20

<?php
function scrapeRestApiPagination($apiEndpoint, $headers = []) {
    $allData = [];
    $page = 1;
    $hasMore = true;

    // Default headers for API requests
    $defaultHeaders = [
        'User-Agent: Mozilla/5.0 (compatible; API Scraper)',
        'Accept: application/json',
        'Content-Type: application/json'
    ];

    $headers = array_merge($defaultHeaders, $headers);

    while ($hasMore) {
        $url = $apiEndpoint . "?page=" . $page . "&limit=50";

        $context = stream_context_create([
            'http' => [
                'method' => 'GET',
                'header' => implode("\r\n", $headers)
            ]
        ]);

        $response = file_get_contents($url, false, $context);

        if (!$response) {
            echo "Failed to fetch page $page\n";
            break;
        }

        $data = json_decode($response, true);

        if (empty($data['results'])) {
            $hasMore = false;
        } else {
            $allData = array_merge($allData, $data['results']);
            echo "Fetched page $page: " . count($data['results']) . " items\n";

            // Check if there are more pages based on API response
            if (isset($data['has_next'])) {
                $hasMore = $data['has_next'];
            } elseif (isset($data['total_pages'])) {
                $hasMore = $page < $data['total_pages'];
            } else {
                // Fallback: assume no more data if results < limit
                $hasMore = count($data['results']) >= 50;
            }
        }

        $page++;
        usleep(500000); // 0.5 second delay
    }

    return $allData;
}
?>

Handling Different Content Types

Scraping Table-Based Pagination

For data presented in tables across multiple pages:

<?php
function scrapeTablePagination($baseUrl) {
    $allRows = [];
    $page = 1;

    while (true) {
        $url = $baseUrl . "?page=" . $page;
        $html = file_get_html($url);

        if (!$html) break;

        $table = $html->find('table.data-table', 0);
        if (!$table) break;

        $rows = $table->find('tr');
        $pageData = [];

        foreach ($rows as $index => $row) {
            // Skip header row
            if ($index === 0) continue;

            $cells = $row->find('td');
            if (count($cells) >= 3) {
                $pageData[] = [
                    'id' => trim($cells[0]->plaintext),
                    'name' => trim($cells[1]->plaintext),
                    'value' => trim($cells[2]->plaintext)
                ];
            }
        }

        if (empty($pageData)) {
            break;
        }

        $allRows = array_merge($allRows, $pageData);
        echo "Page $page: " . count($pageData) . " rows\n";

        $html->clear();
        unset($html);
        $page++;
        sleep(1);
    }

    return $allRows;
}
?>

Monitoring and Debugging

Logging Progress and Errors

<?php
class PaginationScraper {
    private $logFile;

    public function __construct($logFile = 'scraping.log') {
        $this->logFile = $logFile;
    }

    private function log($message) {
        $timestamp = date('Y-m-d H:i:s');
        file_put_contents($this->logFile, "[$timestamp] $message\n", FILE_APPEND);
    }

    public function scrapeWithLogging($baseUrl, $maxPages = 100) {
        $this->log("Starting pagination scraping for: $baseUrl");
        $allData = [];
        $successfulPages = 0;
        $failedPages = 0;

        for ($page = 1; $page <= $maxPages; $page++) {
            $url = $baseUrl . "?page=" . $page;

            try {
                $html = file_get_html($url);

                if (!$html) {
                    throw new Exception("Failed to load HTML");
                }

                $pageData = extractDataFromPage($html);

                if (empty($pageData)) {
                    $this->log("No data found on page $page. Ending scraping.");
                    break;
                }

                $allData = array_merge($allData, $pageData);
                $successfulPages++;

                $this->log("Successfully scraped page $page: " . count($pageData) . " items");

                $html->clear();
                unset($html);

            } catch (Exception $e) {
                $failedPages++;
                $this->log("Error on page $page: " . $e->getMessage());

                if ($failedPages > 5) {
                    $this->log("Too many failures. Stopping scraping.");
                    break;
                }
            }

            sleep(1);
        }

        $this->log("Scraping completed. Total items: " . count($allData) . 
                  ", Successful pages: $successfulPages, Failed pages: $failedPages");

        return $allData;
    }
}
?>

Conclusion

Scraping paginated content requires careful planning and robust implementation. While Simple HTML DOM is excellent for server-side rendered pagination, remember that modern websites often use JavaScript for dynamic content loading. In such cases, you might need to complement Simple HTML DOM with API endpoint analysis or use more advanced tools like Puppeteer for handling browser sessions.

Key takeaways for successful pagination scraping:

Identify the pagination pattern before writing your scraper
Implement proper error handling and retry logic
Use rate limiting to be respectful to target servers
Manage memory efficiently by clearing DOM objects
Consider the website's robots.txt and terms of service
Monitor your scraping performance and adjust accordingly

By following these practices and adapting the code examples to your specific use case, you'll be able to efficiently scrape data from paginated websites while maintaining good performance and reliability.

Table of contents