How do I handle pagination when scraping multiple pages with PHP?
Handling pagination is one of the most common challenges in web scraping. When websites split content across multiple pages, you need robust strategies to navigate through all pages systematically. This comprehensive guide covers various pagination patterns and how to handle them effectively using PHP.
Understanding Common Pagination Patterns
Before diving into code, it's essential to understand the different types of pagination you'll encounter:
- Numbered pagination - Links to specific page numbers (1, 2, 3...)
- Next/Previous pagination - Simple forward/backward navigation
- Load more buttons - JavaScript-triggered content loading
- Infinite scroll - Automatic loading as user scrolls
- URL parameter pagination - Pages identified by query parameters
Basic Pagination Handling with cURL and DOMDocument
Here's a fundamental approach using PHP's built-in functions:
<?php
class PaginationScraper {
private $baseUrl;
private $currentPage = 1;
private $maxPages = 100; // Safety limit
public function __construct($baseUrl) {
$this->baseUrl = $baseUrl;
}
public function scrapeAllPages() {
$allData = [];
while ($this->currentPage <= $this->maxPages) {
$url = $this->buildPageUrl($this->currentPage);
$html = $this->fetchPage($url);
if (!$html) {
break;
}
$pageData = $this->extractData($html);
// Check if page has content
if (empty($pageData)) {
break;
}
$allData = array_merge($allData, $pageData);
// Check if next page exists
if (!$this->hasNextPage($html)) {
break;
}
$this->currentPage++;
// Be respectful - add delay
usleep(500000); // 0.5 second delay
}
return $allData;
}
private function fetchPage($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return ($httpCode === 200) ? $html : false;
}
private function buildPageUrl($pageNumber) {
return $this->baseUrl . "?page=" . $pageNumber;
}
private function extractData($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$items = [];
$nodes = $xpath->query('//div[@class="item"]');
foreach ($nodes as $node) {
$title = $xpath->query('.//h2', $node)->item(0);
$description = $xpath->query('.//p[@class="description"]', $node)->item(0);
if ($title && $description) {
$items[] = [
'title' => trim($title->textContent),
'description' => trim($description->textContent)
];
}
}
return $items;
}
private function hasNextPage($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Look for "Next" button or link
$nextLink = $xpath->query('//a[contains(@class, "next") or contains(text(), "Next")]');
return $nextLink->length > 0;
}
}
// Usage
$scraper = new PaginationScraper('https://example.com/products');
$allProducts = $scraper->scrapeAllPages();
foreach ($allProducts as $product) {
echo "Title: " . $product['title'] . "\n";
echo "Description: " . $product['description'] . "\n\n";
}
?>
Advanced Pagination with Guzzle HTTP
For more sophisticated HTTP handling, use Guzzle:
<?php
require_once 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
class AdvancedPaginationScraper {
private $client;
private $baseUrl;
private $currentPage = 1;
public function __construct($baseUrl) {
$this->baseUrl = $baseUrl;
$this->client = new Client([
'timeout' => 30,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
]
]);
}
public function scrapePaginatedContent($maxPages = 50) {
$allData = [];
for ($page = 1; $page <= $maxPages; $page++) {
try {
$response = $this->client->get($this->baseUrl, [
'query' => ['page' => $page]
]);
if ($response->getStatusCode() !== 200) {
break;
}
$html = $response->getBody()->getContents();
$pageData = $this->parsePageContent($html);
if (empty($pageData)) {
break; // No more content
}
$allData = array_merge($allData, $pageData);
// Check pagination metadata
if (!$this->shouldContinue($html, $page)) {
break;
}
// Rate limiting
sleep(1);
} catch (RequestException $e) {
echo "Error fetching page $page: " . $e->getMessage() . "\n";
break;
}
}
return $allData;
}
private function parsePageContent($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$items = [];
$articleNodes = $xpath->query('//article[@class="post"]');
foreach ($articleNodes as $article) {
$titleNode = $xpath->query('.//h1 | .//h2', $article)->item(0);
$contentNode = $xpath->query('.//div[@class="content"]', $article)->item(0);
$dateNode = $xpath->query('.//time[@datetime]', $article)->item(0);
if ($titleNode) {
$items[] = [
'title' => trim($titleNode->textContent),
'content' => $contentNode ? trim($contentNode->textContent) : '',
'date' => $dateNode ? $dateNode->getAttribute('datetime') : null,
'url' => $this->extractUrl($article, $xpath)
];
}
}
return $items;
}
private function extractUrl($articleNode, $xpath) {
$linkNode = $xpath->query('.//a[@href]', $articleNode)->item(0);
return $linkNode ? $linkNode->getAttribute('href') : null;
}
private function shouldContinue($html, $currentPage) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Method 1: Check for "Next" button
$nextButton = $xpath->query('//a[contains(@class, "next") and not(contains(@class, "disabled"))]');
if ($nextButton->length === 0) {
return false;
}
// Method 2: Check pagination info
$paginationInfo = $xpath->query('//span[@class="pagination-info"]')->item(0);
if ($paginationInfo) {
$text = $paginationInfo->textContent;
// Parse "Page 5 of 10" format
if (preg_match('/Page (\d+) of (\d+)/', $text, $matches)) {
return (int)$matches[1] < (int)$matches[2];
}
}
return true;
}
}
?>
Handling Different Pagination Patterns
URL Parameter Pagination
Many sites use URL parameters for pagination:
<?php
class UrlParameterPagination {
private $baseUrl;
private $client;
public function __construct($baseUrl) {
$this->baseUrl = $baseUrl;
$this->client = new Client();
}
public function scrapeByParameters($paramName = 'page', $startPage = 1) {
$allData = [];
$page = $startPage;
while (true) {
$url = $this->baseUrl . "?" . $paramName . "=" . $page;
try {
$response = $this->client->get($url);
$html = $response->getBody()->getContents();
// Check if page exists (some sites return 404, others return empty content)
if ($response->getStatusCode() === 404) {
break;
}
$data = $this->extractItems($html);
if (empty($data)) {
break;
}
$allData = array_merge($allData, $data);
$page++;
// Optional: Check for explicit pagination end markers
if ($this->isLastPage($html)) {
break;
}
usleep(750000); // 0.75 second delay
} catch (Exception $e) {
echo "Error on page $page: " . $e->getMessage() . "\n";
break;
}
}
return $allData;
}
private function extractItems($html) {
// Implementation depends on site structure
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$items = [];
$nodes = $xpath->query('//div[@class="search-result"]');
foreach ($nodes as $node) {
$items[] = [
'title' => $this->getNodeText($xpath, './/h3', $node),
'price' => $this->getNodeText($xpath, './/span[@class="price"]', $node),
'link' => $this->getNodeAttribute($xpath, './/a[@href]', $node, 'href')
];
}
return $items;
}
private function getNodeText($xpath, $query, $context) {
$node = $xpath->query($query, $context)->item(0);
return $node ? trim($node->textContent) : null;
}
private function getNodeAttribute($xpath, $query, $context, $attribute) {
$node = $xpath->query($query, $context)->item(0);
return $node ? $node->getAttribute($attribute) : null;
}
private function isLastPage($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Look for indicators of last page
$lastPageIndicators = [
'//div[contains(@class, "no-more-results")]',
'//span[contains(text(), "End of results")]',
'//a[@class="next disabled"]'
];
foreach ($lastPageIndicators as $indicator) {
if ($xpath->query($indicator)->length > 0) {
return true;
}
}
return false;
}
}
?>
JSON API Pagination
For sites that use AJAX/JSON for pagination:
<?php
class JsonApiPagination {
private $apiEndpoint;
private $client;
public function __construct($apiEndpoint) {
$this->apiEndpoint = $apiEndpoint;
$this->client = new Client();
}
public function scrapeJsonPagination($itemsPerPage = 20) {
$allItems = [];
$offset = 0;
while (true) {
$response = $this->client->get($this->apiEndpoint, [
'query' => [
'limit' => $itemsPerPage,
'offset' => $offset
],
'headers' => [
'Accept' => 'application/json',
'X-Requested-With' => 'XMLHttpRequest'
]
]);
$data = json_decode($response->getBody()->getContents(), true);
if (!isset($data['items']) || empty($data['items'])) {
break;
}
$allItems = array_merge($allItems, $data['items']);
// Check if we've reached the end
if (count($data['items']) < $itemsPerPage) {
break;
}
// Check for pagination metadata
if (isset($data['has_more']) && !$data['has_more']) {
break;
}
$offset += $itemsPerPage;
usleep(500000); // Rate limiting
}
return $allItems;
}
}
?>
Best Practices and Error Handling
Robust Error Handling
<?php
class RobustPaginationScraper {
private $maxRetries = 3;
private $retryDelay = 2; // seconds
private function fetchWithRetry($url, $attempt = 1) {
try {
$response = $this->client->get($url, [
'timeout' => 30,
'connect_timeout' => 10
]);
return $response->getBody()->getContents();
} catch (Exception $e) {
if ($attempt < $this->maxRetries) {
echo "Attempt $attempt failed, retrying in {$this->retryDelay} seconds...\n";
sleep($this->retryDelay);
return $this->fetchWithRetry($url, $attempt + 1);
}
throw $e;
}
}
private function validatePageContent($html) {
// Check for common error indicators
$errorIndicators = [
'blocked',
'rate limit',
'too many requests',
'service unavailable'
];
$lowercaseHtml = strtolower($html);
foreach ($errorIndicators as $indicator) {
if (strpos($lowercaseHtml, $indicator) !== false) {
throw new Exception("Page content indicates error: $indicator");
}
}
// Check for minimum content length
if (strlen($html) < 1000) {
throw new Exception("Page content too short, possible error page");
}
return true;
}
}
?>
Performance Optimization Tips
- Implement intelligent delays: Use exponential backoff for rate limiting
- Use connection pooling: Reuse HTTP connections when possible
- Cache parsed DOM objects: Avoid re-parsing the same content
- Parallel processing: For large sites, consider using tools like ReactPHP or Swoole
- Memory management: Process pages in batches to avoid memory exhaustion
Handling JavaScript-Heavy Pagination
For sites that heavily rely on JavaScript for pagination, you might need to integrate with headless browsers. While this guide focuses on PHP-native solutions, you can also consider handling dynamic content that loads after page navigation using tools like Puppeteer.
Conclusion
Effective pagination handling in PHP requires understanding the specific pagination pattern used by your target website and implementing robust error handling and rate limiting. The examples provided cover the most common scenarios you'll encounter. Remember to always respect robots.txt files and implement appropriate delays to avoid overwhelming the target servers.
For complex scenarios involving JavaScript-rendered content, consider combining PHP scraping with headless browser solutions or explore advanced authentication techniques when dealing with protected content.
The key to successful pagination scraping is patience, robust error handling, and respectful rate limiting. Start with simple approaches and gradually add complexity as needed for your specific use case.