How do I handle pagination when scraping multiple pages in Java?
Handling pagination is one of the most common challenges in web scraping, especially when dealing with large datasets spread across multiple pages. This comprehensive guide covers various pagination patterns and how to handle them effectively in Java using popular libraries like JSoup, HttpClient, and Selenium.
Understanding Pagination Types
Before diving into implementation, it's important to understand the different types of pagination you might encounter:
1. URL-based Pagination
The simplest form where page numbers are part of the URL:
- https://example.com/products?page=1
- https://example.com/products?offset=0&limit=20
- https://example.com/products/page/2
2. Button-based Pagination
Pages with "Next" buttons or numbered page links that require clicking.
3. Infinite Scroll
Content loads dynamically as the user scrolls down the page.
4. Load More Buttons
A button that loads additional content when clicked.
URL-based Pagination with JSoup and HttpClient
Here's a comprehensive example using JSoup for HTML parsing and HttpClient for making requests:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
public class PaginationScraper {
private final HttpClient httpClient;
private final String baseUrl;
private final int maxRetries = 3;
private final Duration requestTimeout = Duration.ofSeconds(30);
public PaginationScraper(String baseUrl) {
this.baseUrl = baseUrl;
this.httpClient = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
}
public List<String> scrapeAllPages() {
List<String> allData = new ArrayList<>();
int currentPage = 1;
boolean hasMorePages = true;
while (hasMorePages) {
try {
String url = buildPageUrl(currentPage);
Document page = fetchPage(url);
if (page == null) {
System.err.println("Failed to fetch page " + currentPage);
break;
}
// Extract data from current page
List<String> pageData = extractDataFromPage(page);
allData.addAll(pageData);
// Check if there are more pages
hasMorePages = hasNextPage(page, currentPage);
currentPage++;
// Rate limiting - be respectful to the server
Thread.sleep(1000);
System.out.println("Scraped page " + (currentPage - 1) +
", found " + pageData.size() + " items");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
} catch (Exception e) {
System.err.println("Error processing page " + currentPage + ": " +
e.getMessage());
break;
}
}
return allData;
}
private String buildPageUrl(int pageNumber) {
// Adapt this based on the site's URL structure
return String.format("%s?page=%d", baseUrl, pageNumber);
}
private Document fetchPage(String url) {
for (int attempt = 1; attempt <= maxRetries; attempt++) {
try {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(requestTimeout)
.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
.GET()
.build();
HttpResponse<String> response = httpClient.send(request,
HttpResponse.BodyHandlers.ofString());
if (response.statusCode() == 200) {
return Jsoup.parse(response.body());
} else if (response.statusCode() == 404) {
// Likely reached end of pagination
return null;
} else {
System.err.println("HTTP " + response.statusCode() + " for " + url);
}
} catch (Exception e) {
System.err.println("Attempt " + attempt + " failed for " + url +
": " + e.getMessage());
if (attempt < maxRetries) {
try {
Thread.sleep(2000 * attempt); // Exponential backoff
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
return null;
}
}
}
}
return null;
}
private List<String> extractDataFromPage(Document page) {
List<String> data = new ArrayList<>();
// Adapt these selectors based on the site's structure
Elements items = page.select(".product-item");
for (Element item : items) {
String title = item.select(".product-title").text();
String price = item.select(".product-price").text();
String link = item.select("a").attr("href");
data.add(String.format("Title: %s, Price: %s, Link: %s",
title, price, link));
}
return data;
}
private boolean hasNextPage(Document page, int currentPage) {
// Method 1: Check for "Next" button
Elements nextButton = page.select("a.next-page, .pagination-next");
if (!nextButton.isEmpty()) {
return true;
}
// Method 2: Check if current page has minimum expected items
Elements items = page.select(".product-item");
int expectedItemsPerPage = 20; // Adjust based on site
if (items.size() < expectedItemsPerPage) {
return false;
}
// Method 3: Check pagination numbers
Elements pageNumbers = page.select(".pagination a");
for (Element pageLink : pageNumbers) {
try {
int pageNum = Integer.parseInt(pageLink.text().trim());
if (pageNum > currentPage) {
return true;
}
} catch (NumberFormatException ignored) {
// Not a page number
}
}
return false;
}
}
Advanced Pagination with Offset-based URLs
Many APIs and websites use offset-based pagination. Here's how to handle it:
public class OffsetPaginationScraper {
private final HttpClient httpClient;
private final String baseUrl;
private final int itemsPerPage;
public OffsetPaginationScraper(String baseUrl, int itemsPerPage) {
this.baseUrl = baseUrl;
this.itemsPerPage = itemsPerPage;
this.httpClient = HttpClient.newHttpClient();
}
public List<String> scrapeWithOffset() {
List<String> allData = new ArrayList<>();
int offset = 0;
boolean hasMoreData = true;
while (hasMoreData) {
String url = String.format("%s?offset=%d&limit=%d",
baseUrl, offset, itemsPerPage);
try {
Document page = fetchPage(url);
if (page == null) break;
List<String> pageData = extractDataFromPage(page);
if (pageData.isEmpty() || pageData.size() < itemsPerPage) {
hasMoreData = false;
}
allData.addAll(pageData);
offset += itemsPerPage;
// Rate limiting
Thread.sleep(1000);
} catch (Exception e) {
System.err.println("Error at offset " + offset + ": " + e.getMessage());
break;
}
}
return allData;
}
}
Handling Dynamic Pagination with Selenium
For sites with JavaScript-based pagination, you'll need a browser automation tool like Selenium. This approach is particularly useful when dealing with complex single-page applications or infinite scroll mechanisms:
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.JavascriptExecutor;
import java.time.Duration;
import java.util.List;
import java.util.ArrayList;
public class DynamicPaginationScraper {
private WebDriver driver;
private WebDriverWait wait;
public DynamicPaginationScraper() {
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless"); // Run in background
options.addArguments("--no-sandbox");
options.addArguments("--disable-dev-shm-usage");
options.addArguments("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");
this.driver = new ChromeDriver(options);
this.wait = new WebDriverWait(driver, Duration.ofSeconds(10));
}
public List<String> scrapeWithSelenium(String startUrl) {
List<String> allData = new ArrayList<>();
try {
driver.get(startUrl);
boolean hasMorePages = true;
int pageCount = 1;
while (hasMorePages) {
// Wait for content to load
wait.until(ExpectedConditions.presenceOfElementLocated(
By.className("product-item")));
// Extract data from current page
List<WebElement> items = driver.findElements(By.className("product-item"));
for (WebElement item : items) {
try {
String title = item.findElement(By.className("product-title")).getText();
String price = item.findElement(By.className("product-price")).getText();
allData.add(String.format("Page %d - Title: %s, Price: %s",
pageCount, title, price));
} catch (Exception e) {
// Handle missing elements gracefully
System.err.println("Error extracting item data: " + e.getMessage());
}
}
// Try to find and click next button
hasMorePages = navigateToNextPage();
pageCount++;
// Add delay between page loads
Thread.sleep(2000);
System.out.println("Completed page " + (pageCount - 1));
}
} catch (Exception e) {
System.err.println("Error during scraping: " + e.getMessage());
} finally {
driver.quit();
}
return allData;
}
private boolean navigateToNextPage() {
try {
// Method 1: Look for next button
WebElement nextButton = driver.findElement(
By.cssSelector("a.next-page, .pagination-next, button[aria-label='Next']"));
if (nextButton.isEnabled()) {
// Scroll to button if needed
((JavascriptExecutor) driver).executeScript(
"arguments[0].scrollIntoView(true);", nextButton);
Thread.sleep(500);
nextButton.click();
// Wait for new content to load
Thread.sleep(2000);
return true;
}
} catch (Exception e) {
// Next button not found or not clickable
System.out.println("No more pages available");
}
return false;
}
// Handle infinite scroll pagination
public List<String> scrapeInfiniteScroll(String url) {
List<String> allData = new ArrayList<>();
try {
driver.get(url);
JavascriptExecutor js = (JavascriptExecutor) driver;
long lastHeight = (Long) js.executeScript("return document.body.scrollHeight");
while (true) {
// Extract current data
List<WebElement> items = driver.findElements(By.className("product-item"));
for (WebElement item : items) {
String itemText = item.getText();
if (!allData.contains(itemText)) { // Avoid duplicates
allData.add(itemText);
}
}
// Scroll to bottom
js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
// Wait for new content to load
Thread.sleep(3000);
// Check if more content loaded
long newHeight = (Long) js.executeScript("return document.body.scrollHeight");
if (newHeight == lastHeight) {
break; // No more content
}
lastHeight = newHeight;
}
} catch (Exception e) {
System.err.println("Error during infinite scroll: " + e.getMessage());
} finally {
driver.quit();
}
return allData;
}
}
Handling Load More Buttons
Some sites use "Load More" buttons instead of traditional pagination:
public class LoadMoreScraper {
private WebDriver driver;
private WebDriverWait wait;
public List<String> scrapeWithLoadMore(String url) {
List<String> allData = new ArrayList<>();
try {
driver.get(url);
while (true) {
// Extract current items
List<WebElement> items = driver.findElements(By.className("product-item"));
int currentItemCount = items.size();
for (WebElement item : items) {
String itemText = item.getText();
if (!allData.contains(itemText)) {
allData.add(itemText);
}
}
// Look for load more button
try {
WebElement loadMoreButton = wait.until(
ExpectedConditions.elementToBeClickable(
By.cssSelector("button.load-more, .load-more-btn")));
loadMoreButton.click();
// Wait for new items to load
wait.until(ExpectedConditions.numberOfElementsToBeMoreThan(
By.className("product-item"), currentItemCount));
Thread.sleep(1000); // Additional buffer
} catch (Exception e) {
// No more load button or timeout
break;
}
}
} catch (Exception e) {
System.err.println("Error during load more scraping: " + e.getMessage());
} finally {
if (driver != null) {
driver.quit();
}
}
return allData;
}
}
Best Practices and Error Handling
1. Implement Robust Error Handling
public class RobustPaginationScraper {
private static final int MAX_CONSECUTIVE_FAILURES = 3;
private int consecutiveFailures = 0;
private boolean shouldContinueScraping() {
return consecutiveFailures < MAX_CONSECUTIVE_FAILURES;
}
private void handleSuccess() {
consecutiveFailures = 0;
}
private void handleFailure() {
consecutiveFailures++;
System.err.println("Consecutive failures: " + consecutiveFailures);
}
}
2. Respect Rate Limits
Always implement delays between requests and respect the website's robots.txt file:
private void respectRateLimit() throws InterruptedException {
// Randomize delays to appear more human-like
int delay = 1000 + (int) (Math.random() * 2000); // 1-3 seconds
Thread.sleep(delay);
}
3. Use Connection Pooling
For high-volume scraping, use connection pooling to improve performance:
HttpClient httpClient = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.executor(Executors.newFixedThreadPool(5))
.build();
Conclusion
Handling pagination in Java requires understanding the specific pagination pattern used by the target website and choosing the appropriate scraping approach. For simple URL-based pagination, JSoup with HttpClient is efficient and lightweight. For complex JavaScript-heavy sites, Selenium provides the necessary browser automation capabilities.
Remember to always implement proper error handling, respect rate limits, and follow ethical scraping practices. Consider using browser automation tools for complex scenarios where traditional HTTP clients fall short.
The key to successful pagination handling is to start simple with URL-based approaches and progressively move to more sophisticated techniques like browser automation only when necessary. Always test your pagination logic thoroughly with edge cases like empty pages, network failures, and unexpected page structures.