How do I scrape data from paginated content?
Scraping data from paginated content is one of the most common challenges in web scraping. When websites split large datasets across multiple pages, you need to systematically navigate through each page to collect all available data. This guide will show you how to effectively scrape paginated content using Simple HTML DOM in PHP, along with alternative approaches for different scenarios.
Understanding Pagination Patterns
Before diving into code, it's crucial to understand the different types of pagination you'll encounter:
1. Numbered Pagination
The most common type with numbered links (1, 2, 3, ..., Next)
2. Next/Previous Pagination
Simple navigation with only "Next" and "Previous" buttons
3. Load More Pagination
Pages that load additional content dynamically when clicking a "Load More" button
4. Infinite Scroll Pagination
Content loads automatically as you scroll down the page
Basic Pagination Scraping with Simple HTML DOM
Here's a fundamental approach to scraping paginated content using PHP and Simple HTML DOM:
<?php
require_once('simple_html_dom.php');
function scrapePaginatedData($baseUrl, $maxPages = 10) {
$allData = [];
$currentPage = 1;
while ($currentPage <= $maxPages) {
// Construct the URL for the current page
$url = $baseUrl . "?page=" . $currentPage;
// Create DOM object
$html = file_get_html($url);
if (!$html) {
echo "Failed to load page: $url\n";
break;
}
// Extract data from current page
$pageData = extractDataFromPage($html);
// If no data found, we might have reached the end
if (empty($pageData)) {
echo "No data found on page $currentPage. Stopping.\n";
break;
}
// Add data to our collection
$allData = array_merge($allData, $pageData);
echo "Scraped page $currentPage: " . count($pageData) . " items\n";
// Clean up memory
$html->clear();
unset($html);
// Add delay to be respectful to the server
sleep(1);
$currentPage++;
}
return $allData;
}
function extractDataFromPage($html) {
$data = [];
// Example: Extract product information
foreach ($html->find('.product-item') as $item) {
$product = [
'name' => $item->find('.product-name', 0)->plaintext ?? '',
'price' => $item->find('.product-price', 0)->plaintext ?? '',
'url' => $item->find('a', 0)->href ?? ''
];
if (!empty($product['name'])) {
$data[] = $product;
}
}
return $data;
}
// Usage
$baseUrl = "https://example-shop.com/products";
$scrapedData = scrapePaginatedData($baseUrl, 50);
echo "Total items scraped: " . count($scrapedData) . "\n";
?>
Advanced Pagination Detection
For more robust scraping, implement automatic pagination detection:
<?php
function scrapeWithAutoPagination($baseUrl) {
$allData = [];
$currentUrl = $baseUrl;
$visitedUrls = [];
while ($currentUrl && !in_array($currentUrl, $visitedUrls)) {
$visitedUrls[] = $currentUrl;
echo "Scraping: $currentUrl\n";
$html = file_get_html($currentUrl);
if (!$html) break;
// Extract data from current page
$pageData = extractDataFromPage($html);
$allData = array_merge($allData, $pageData);
// Find next page URL
$nextUrl = findNextPageUrl($html, $currentUrl);
$html->clear();
unset($html);
$currentUrl = $nextUrl;
sleep(1); // Rate limiting
}
return $allData;
}
function findNextPageUrl($html, $currentUrl) {
// Method 1: Look for "Next" button
$nextLink = $html->find('a.next', 0);
if ($nextLink && $nextLink->href) {
return makeAbsoluteUrl($nextLink->href, $currentUrl);
}
// Method 2: Look for numbered pagination
$paginationLinks = $html->find('.pagination a');
foreach ($paginationLinks as $link) {
if (stripos($link->plaintext, 'next') !== false) {
return makeAbsoluteUrl($link->href, $currentUrl);
}
}
// Method 3: Look for rel="next" attribute
$nextRel = $html->find('a[rel="next"]', 0);
if ($nextRel && $nextRel->href) {
return makeAbsoluteUrl($nextRel->href, $currentUrl);
}
return null;
}
function makeAbsoluteUrl($relativeUrl, $baseUrl) {
if (filter_var($relativeUrl, FILTER_VALIDATE_URL)) {
return $relativeUrl;
}
$base = parse_url($baseUrl);
$scheme = $base['scheme'] ?? 'https';
$host = $base['host'] ?? '';
if (strpos($relativeUrl, '/') === 0) {
return $scheme . '://' . $host . $relativeUrl;
}
$path = dirname($base['path'] ?? '') . '/';
return $scheme . '://' . $host . $path . $relativeUrl;
}
?>
Handling Different Pagination Patterns
Query Parameter Pagination
Many sites use query parameters for pagination:
<?php
function scrapeQueryParamPagination($baseUrl) {
$allData = [];
$page = 1;
$hasNextPage = true;
while ($hasNextPage) {
$url = $baseUrl . "?page=" . $page;
$html = file_get_html($url);
if (!$html) break;
$pageData = extractDataFromPage($html);
if (empty($pageData)) {
$hasNextPage = false;
} else {
$allData = array_merge($allData, $pageData);
echo "Page $page: " . count($pageData) . " items\n";
}
$html->clear();
unset($html);
$page++;
sleep(1);
}
return $allData;
}
?>
Offset-Based Pagination
Some APIs and websites use offset/limit parameters:
<?php
function scrapeOffsetPagination($baseUrl, $limit = 20) {
$allData = [];
$offset = 0;
$hasMoreData = true;
while ($hasMoreData) {
$url = $baseUrl . "?limit=" . $limit . "&offset=" . $offset;
$html = file_get_html($url);
if (!$html) break;
$pageData = extractDataFromPage($html);
if (count($pageData) < $limit) {
$hasMoreData = false;
}
if (!empty($pageData)) {
$allData = array_merge($allData, $pageData);
echo "Offset $offset: " . count($pageData) . " items\n";
} else {
$hasMoreData = false;
}
$html->clear();
unset($html);
$offset += $limit;
sleep(1);
}
return $allData;
}
?>
Advanced Techniques and Best Practices
Using cURL with Cookies for Session Management
For websites that require session management:
<?php
function scrapeWithSession($baseUrl) {
$cookieFile = tempnam(sys_get_temp_dir(), 'cookies');
$allData = [];
$page = 1;
while (true) {
$url = $baseUrl . "?page=" . $page;
// Use cURL with cookie support
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; Web Scraper)');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$htmlContent = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200 || !$htmlContent) {
break;
}
$html = str_get_html($htmlContent);
if (!$html) break;
$pageData = extractDataFromPage($html);
if (empty($pageData)) {
break;
}
$allData = array_merge($allData, $pageData);
$html->clear();
unset($html);
$page++;
sleep(2); // Longer delay for respectful scraping
}
unlink($cookieFile); // Clean up
return $allData;
}
?>
Error Handling and Retry Logic
Implement robust error handling for production scraping:
<?php
function scrapeWithRetry($url, $maxRetries = 3) {
$retries = 0;
while ($retries < $maxRetries) {
try {
$html = file_get_html($url);
if ($html) {
return $html;
}
throw new Exception("Failed to load HTML from: $url");
} catch (Exception $e) {
$retries++;
echo "Attempt $retries failed: " . $e->getMessage() . "\n";
if ($retries < $maxRetries) {
echo "Retrying in " . ($retries * 2) . " seconds...\n";
sleep($retries * 2); // Exponential backoff
}
}
}
return false;
}
?>
Handling JavaScript-Heavy Pagination
While Simple HTML DOM cannot execute JavaScript, you can still work with JavaScript-heavy sites by understanding their API endpoints:
<?php
function scrapeAjaxPagination($baseApiUrl) {
$allData = [];
$page = 1;
while (true) {
$apiUrl = $baseApiUrl . "?page=" . $page . "&format=json";
$jsonData = file_get_contents($apiUrl);
if (!$jsonData) break;
$data = json_decode($jsonData, true);
if (empty($data['items'])) {
break;
}
$allData = array_merge($allData, $data['items']);
echo "API Page $page: " . count($data['items']) . " items\n";
// Check if there are more pages
if (!isset($data['has_next']) || !$data['has_next']) {
break;
}
$page++;
sleep(1);
}
return $allData;
}
?>
For complex JavaScript-rendered pagination, consider using Puppeteer for handling dynamic content or navigating to different pages programmatically.
Performance Optimization Tips
1. Memory Management
// Always clear DOM objects to prevent memory leaks
$html->clear();
unset($html);
2. Batch Processing
function processPagesInBatches($urls, $batchSize = 10) {
$batches = array_chunk($urls, $batchSize);
$allData = [];
foreach ($batches as $batch) {
$batchData = [];
foreach ($batch as $url) {
$html = file_get_html($url);
if ($html) {
$batchData = array_merge($batchData, extractDataFromPage($html));
$html->clear();
unset($html);
}
}
$allData = array_merge($allData, $batchData);
// Process batch data (save to database, etc.)
processBatchData($batchData);
sleep(2); // Rest between batches
}
return $allData;
}
3. Rate Limiting
class RateLimiter {
private $requests = 0;
private $startTime;
private $maxRequests;
private $timeWindow;
public function __construct($maxRequests = 60, $timeWindow = 60) {
$this->maxRequests = $maxRequests;
$this->timeWindow = $timeWindow;
$this->startTime = time();
}
public function throttle() {
$this->requests++;
if ($this->requests >= $this->maxRequests) {
$elapsed = time() - $this->startTime;
if ($elapsed < $this->timeWindow) {
$sleepTime = $this->timeWindow - $elapsed;
echo "Rate limit reached. Sleeping for $sleepTime seconds...\n";
sleep($sleepTime);
}
// Reset counter
$this->requests = 0;
$this->startTime = time();
}
}
}
JavaScript Pagination with API Endpoints
Many modern websites use AJAX calls for pagination. You can intercept these calls:
# Use browser developer tools to find API endpoints
# Network tab -> Filter by XHR/Fetch -> Look for pagination requests
# Example URL: https://api.example.com/products?page=2&limit=20
<?php
function scrapeRestApiPagination($apiEndpoint, $headers = []) {
$allData = [];
$page = 1;
$hasMore = true;
// Default headers for API requests
$defaultHeaders = [
'User-Agent: Mozilla/5.0 (compatible; API Scraper)',
'Accept: application/json',
'Content-Type: application/json'
];
$headers = array_merge($defaultHeaders, $headers);
while ($hasMore) {
$url = $apiEndpoint . "?page=" . $page . "&limit=50";
$context = stream_context_create([
'http' => [
'method' => 'GET',
'header' => implode("\r\n", $headers)
]
]);
$response = file_get_contents($url, false, $context);
if (!$response) {
echo "Failed to fetch page $page\n";
break;
}
$data = json_decode($response, true);
if (empty($data['results'])) {
$hasMore = false;
} else {
$allData = array_merge($allData, $data['results']);
echo "Fetched page $page: " . count($data['results']) . " items\n";
// Check if there are more pages based on API response
if (isset($data['has_next'])) {
$hasMore = $data['has_next'];
} elseif (isset($data['total_pages'])) {
$hasMore = $page < $data['total_pages'];
} else {
// Fallback: assume no more data if results < limit
$hasMore = count($data['results']) >= 50;
}
}
$page++;
usleep(500000); // 0.5 second delay
}
return $allData;
}
?>
Handling Different Content Types
Scraping Table-Based Pagination
For data presented in tables across multiple pages:
<?php
function scrapeTablePagination($baseUrl) {
$allRows = [];
$page = 1;
while (true) {
$url = $baseUrl . "?page=" . $page;
$html = file_get_html($url);
if (!$html) break;
$table = $html->find('table.data-table', 0);
if (!$table) break;
$rows = $table->find('tr');
$pageData = [];
foreach ($rows as $index => $row) {
// Skip header row
if ($index === 0) continue;
$cells = $row->find('td');
if (count($cells) >= 3) {
$pageData[] = [
'id' => trim($cells[0]->plaintext),
'name' => trim($cells[1]->plaintext),
'value' => trim($cells[2]->plaintext)
];
}
}
if (empty($pageData)) {
break;
}
$allRows = array_merge($allRows, $pageData);
echo "Page $page: " . count($pageData) . " rows\n";
$html->clear();
unset($html);
$page++;
sleep(1);
}
return $allRows;
}
?>
Monitoring and Debugging
Logging Progress and Errors
<?php
class PaginationScraper {
private $logFile;
public function __construct($logFile = 'scraping.log') {
$this->logFile = $logFile;
}
private function log($message) {
$timestamp = date('Y-m-d H:i:s');
file_put_contents($this->logFile, "[$timestamp] $message\n", FILE_APPEND);
}
public function scrapeWithLogging($baseUrl, $maxPages = 100) {
$this->log("Starting pagination scraping for: $baseUrl");
$allData = [];
$successfulPages = 0;
$failedPages = 0;
for ($page = 1; $page <= $maxPages; $page++) {
$url = $baseUrl . "?page=" . $page;
try {
$html = file_get_html($url);
if (!$html) {
throw new Exception("Failed to load HTML");
}
$pageData = extractDataFromPage($html);
if (empty($pageData)) {
$this->log("No data found on page $page. Ending scraping.");
break;
}
$allData = array_merge($allData, $pageData);
$successfulPages++;
$this->log("Successfully scraped page $page: " . count($pageData) . " items");
$html->clear();
unset($html);
} catch (Exception $e) {
$failedPages++;
$this->log("Error on page $page: " . $e->getMessage());
if ($failedPages > 5) {
$this->log("Too many failures. Stopping scraping.");
break;
}
}
sleep(1);
}
$this->log("Scraping completed. Total items: " . count($allData) .
", Successful pages: $successfulPages, Failed pages: $failedPages");
return $allData;
}
}
?>
Conclusion
Scraping paginated content requires careful planning and robust implementation. While Simple HTML DOM is excellent for server-side rendered pagination, remember that modern websites often use JavaScript for dynamic content loading. In such cases, you might need to complement Simple HTML DOM with API endpoint analysis or use more advanced tools like Puppeteer for handling browser sessions.
Key takeaways for successful pagination scraping:
- Identify the pagination pattern before writing your scraper
- Implement proper error handling and retry logic
- Use rate limiting to be respectful to target servers
- Manage memory efficiently by clearing DOM objects
- Consider the website's robots.txt and terms of service
- Monitor your scraping performance and adjust accordingly
By following these practices and adapting the code examples to your specific use case, you'll be able to efficiently scrape data from paginated websites while maintaining good performance and reliability.