Table of contents

How do I handle pagination when scraping multiple pages with PHP?

Handling pagination is one of the most common challenges in web scraping. When websites split content across multiple pages, you need robust strategies to navigate through all pages systematically. This comprehensive guide covers various pagination patterns and how to handle them effectively using PHP.

Understanding Common Pagination Patterns

Before diving into code, it's essential to understand the different types of pagination you'll encounter:

  1. Numbered pagination - Links to specific page numbers (1, 2, 3...)
  2. Next/Previous pagination - Simple forward/backward navigation
  3. Load more buttons - JavaScript-triggered content loading
  4. Infinite scroll - Automatic loading as user scrolls
  5. URL parameter pagination - Pages identified by query parameters

Basic Pagination Handling with cURL and DOMDocument

Here's a fundamental approach using PHP's built-in functions:

<?php
class PaginationScraper {
    private $baseUrl;
    private $currentPage = 1;
    private $maxPages = 100; // Safety limit

    public function __construct($baseUrl) {
        $this->baseUrl = $baseUrl;
    }

    public function scrapeAllPages() {
        $allData = [];

        while ($this->currentPage <= $this->maxPages) {
            $url = $this->buildPageUrl($this->currentPage);
            $html = $this->fetchPage($url);

            if (!$html) {
                break;
            }

            $pageData = $this->extractData($html);

            // Check if page has content
            if (empty($pageData)) {
                break;
            }

            $allData = array_merge($allData, $pageData);

            // Check if next page exists
            if (!$this->hasNextPage($html)) {
                break;
            }

            $this->currentPage++;

            // Be respectful - add delay
            usleep(500000); // 0.5 second delay
        }

        return $allData;
    }

    private function fetchPage($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
        curl_setopt($ch, CURLOPT_TIMEOUT, 30);

        $html = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        return ($httpCode === 200) ? $html : false;
    }

    private function buildPageUrl($pageNumber) {
        return $this->baseUrl . "?page=" . $pageNumber;
    }

    private function extractData($html) {
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);

        $items = [];
        $nodes = $xpath->query('//div[@class="item"]');

        foreach ($nodes as $node) {
            $title = $xpath->query('.//h2', $node)->item(0);
            $description = $xpath->query('.//p[@class="description"]', $node)->item(0);

            if ($title && $description) {
                $items[] = [
                    'title' => trim($title->textContent),
                    'description' => trim($description->textContent)
                ];
            }
        }

        return $items;
    }

    private function hasNextPage($html) {
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);

        // Look for "Next" button or link
        $nextLink = $xpath->query('//a[contains(@class, "next") or contains(text(), "Next")]');

        return $nextLink->length > 0;
    }
}

// Usage
$scraper = new PaginationScraper('https://example.com/products');
$allProducts = $scraper->scrapeAllPages();

foreach ($allProducts as $product) {
    echo "Title: " . $product['title'] . "\n";
    echo "Description: " . $product['description'] . "\n\n";
}
?>

Advanced Pagination with Guzzle HTTP

For more sophisticated HTTP handling, use Guzzle:

<?php
require_once 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;

class AdvancedPaginationScraper {
    private $client;
    private $baseUrl;
    private $currentPage = 1;

    public function __construct($baseUrl) {
        $this->baseUrl = $baseUrl;
        $this->client = new Client([
            'timeout' => 30,
            'headers' => [
                'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            ]
        ]);
    }

    public function scrapePaginatedContent($maxPages = 50) {
        $allData = [];

        for ($page = 1; $page <= $maxPages; $page++) {
            try {
                $response = $this->client->get($this->baseUrl, [
                    'query' => ['page' => $page]
                ]);

                if ($response->getStatusCode() !== 200) {
                    break;
                }

                $html = $response->getBody()->getContents();
                $pageData = $this->parsePageContent($html);

                if (empty($pageData)) {
                    break; // No more content
                }

                $allData = array_merge($allData, $pageData);

                // Check pagination metadata
                if (!$this->shouldContinue($html, $page)) {
                    break;
                }

                // Rate limiting
                sleep(1);

            } catch (RequestException $e) {
                echo "Error fetching page $page: " . $e->getMessage() . "\n";
                break;
            }
        }

        return $allData;
    }

    private function parsePageContent($html) {
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);

        $items = [];
        $articleNodes = $xpath->query('//article[@class="post"]');

        foreach ($articleNodes as $article) {
            $titleNode = $xpath->query('.//h1 | .//h2', $article)->item(0);
            $contentNode = $xpath->query('.//div[@class="content"]', $article)->item(0);
            $dateNode = $xpath->query('.//time[@datetime]', $article)->item(0);

            if ($titleNode) {
                $items[] = [
                    'title' => trim($titleNode->textContent),
                    'content' => $contentNode ? trim($contentNode->textContent) : '',
                    'date' => $dateNode ? $dateNode->getAttribute('datetime') : null,
                    'url' => $this->extractUrl($article, $xpath)
                ];
            }
        }

        return $items;
    }

    private function extractUrl($articleNode, $xpath) {
        $linkNode = $xpath->query('.//a[@href]', $articleNode)->item(0);
        return $linkNode ? $linkNode->getAttribute('href') : null;
    }

    private function shouldContinue($html, $currentPage) {
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);

        // Method 1: Check for "Next" button
        $nextButton = $xpath->query('//a[contains(@class, "next") and not(contains(@class, "disabled"))]');
        if ($nextButton->length === 0) {
            return false;
        }

        // Method 2: Check pagination info
        $paginationInfo = $xpath->query('//span[@class="pagination-info"]')->item(0);
        if ($paginationInfo) {
            $text = $paginationInfo->textContent;
            // Parse "Page 5 of 10" format
            if (preg_match('/Page (\d+) of (\d+)/', $text, $matches)) {
                return (int)$matches[1] < (int)$matches[2];
            }
        }

        return true;
    }
}
?>

Handling Different Pagination Patterns

URL Parameter Pagination

Many sites use URL parameters for pagination:

<?php
class UrlParameterPagination {
    private $baseUrl;
    private $client;

    public function __construct($baseUrl) {
        $this->baseUrl = $baseUrl;
        $this->client = new Client();
    }

    public function scrapeByParameters($paramName = 'page', $startPage = 1) {
        $allData = [];
        $page = $startPage;

        while (true) {
            $url = $this->baseUrl . "?" . $paramName . "=" . $page;

            try {
                $response = $this->client->get($url);
                $html = $response->getBody()->getContents();

                // Check if page exists (some sites return 404, others return empty content)
                if ($response->getStatusCode() === 404) {
                    break;
                }

                $data = $this->extractItems($html);

                if (empty($data)) {
                    break;
                }

                $allData = array_merge($allData, $data);
                $page++;

                // Optional: Check for explicit pagination end markers
                if ($this->isLastPage($html)) {
                    break;
                }

                usleep(750000); // 0.75 second delay

            } catch (Exception $e) {
                echo "Error on page $page: " . $e->getMessage() . "\n";
                break;
            }
        }

        return $allData;
    }

    private function extractItems($html) {
        // Implementation depends on site structure
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);

        $items = [];
        $nodes = $xpath->query('//div[@class="search-result"]');

        foreach ($nodes as $node) {
            $items[] = [
                'title' => $this->getNodeText($xpath, './/h3', $node),
                'price' => $this->getNodeText($xpath, './/span[@class="price"]', $node),
                'link' => $this->getNodeAttribute($xpath, './/a[@href]', $node, 'href')
            ];
        }

        return $items;
    }

    private function getNodeText($xpath, $query, $context) {
        $node = $xpath->query($query, $context)->item(0);
        return $node ? trim($node->textContent) : null;
    }

    private function getNodeAttribute($xpath, $query, $context, $attribute) {
        $node = $xpath->query($query, $context)->item(0);
        return $node ? $node->getAttribute($attribute) : null;
    }

    private function isLastPage($html) {
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $xpath = new DOMXPath($dom);

        // Look for indicators of last page
        $lastPageIndicators = [
            '//div[contains(@class, "no-more-results")]',
            '//span[contains(text(), "End of results")]',
            '//a[@class="next disabled"]'
        ];

        foreach ($lastPageIndicators as $indicator) {
            if ($xpath->query($indicator)->length > 0) {
                return true;
            }
        }

        return false;
    }
}
?>

JSON API Pagination

For sites that use AJAX/JSON for pagination:

<?php
class JsonApiPagination {
    private $apiEndpoint;
    private $client;

    public function __construct($apiEndpoint) {
        $this->apiEndpoint = $apiEndpoint;
        $this->client = new Client();
    }

    public function scrapeJsonPagination($itemsPerPage = 20) {
        $allItems = [];
        $offset = 0;

        while (true) {
            $response = $this->client->get($this->apiEndpoint, [
                'query' => [
                    'limit' => $itemsPerPage,
                    'offset' => $offset
                ],
                'headers' => [
                    'Accept' => 'application/json',
                    'X-Requested-With' => 'XMLHttpRequest'
                ]
            ]);

            $data = json_decode($response->getBody()->getContents(), true);

            if (!isset($data['items']) || empty($data['items'])) {
                break;
            }

            $allItems = array_merge($allItems, $data['items']);

            // Check if we've reached the end
            if (count($data['items']) < $itemsPerPage) {
                break;
            }

            // Check for pagination metadata
            if (isset($data['has_more']) && !$data['has_more']) {
                break;
            }

            $offset += $itemsPerPage;
            usleep(500000); // Rate limiting
        }

        return $allItems;
    }
}
?>

Best Practices and Error Handling

Robust Error Handling

<?php
class RobustPaginationScraper {
    private $maxRetries = 3;
    private $retryDelay = 2; // seconds

    private function fetchWithRetry($url, $attempt = 1) {
        try {
            $response = $this->client->get($url, [
                'timeout' => 30,
                'connect_timeout' => 10
            ]);

            return $response->getBody()->getContents();

        } catch (Exception $e) {
            if ($attempt < $this->maxRetries) {
                echo "Attempt $attempt failed, retrying in {$this->retryDelay} seconds...\n";
                sleep($this->retryDelay);
                return $this->fetchWithRetry($url, $attempt + 1);
            }

            throw $e;
        }
    }

    private function validatePageContent($html) {
        // Check for common error indicators
        $errorIndicators = [
            'blocked',
            'rate limit',
            'too many requests',
            'service unavailable'
        ];

        $lowercaseHtml = strtolower($html);

        foreach ($errorIndicators as $indicator) {
            if (strpos($lowercaseHtml, $indicator) !== false) {
                throw new Exception("Page content indicates error: $indicator");
            }
        }

        // Check for minimum content length
        if (strlen($html) < 1000) {
            throw new Exception("Page content too short, possible error page");
        }

        return true;
    }
}
?>

Performance Optimization Tips

  1. Implement intelligent delays: Use exponential backoff for rate limiting
  2. Use connection pooling: Reuse HTTP connections when possible
  3. Cache parsed DOM objects: Avoid re-parsing the same content
  4. Parallel processing: For large sites, consider using tools like ReactPHP or Swoole
  5. Memory management: Process pages in batches to avoid memory exhaustion

Handling JavaScript-Heavy Pagination

For sites that heavily rely on JavaScript for pagination, you might need to integrate with headless browsers. While this guide focuses on PHP-native solutions, you can also consider handling dynamic content that loads after page navigation using tools like Puppeteer.

Conclusion

Effective pagination handling in PHP requires understanding the specific pagination pattern used by your target website and implementing robust error handling and rate limiting. The examples provided cover the most common scenarios you'll encounter. Remember to always respect robots.txt files and implement appropriate delays to avoid overwhelming the target servers.

For complex scenarios involving JavaScript-rendered content, consider combining PHP scraping with headless browser solutions or explore advanced authentication techniques when dealing with protected content.

The key to successful pagination scraping is patience, robust error handling, and respectful rate limiting. Start with simple approaches and gradually add complexity as needed for your specific use case.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon