Table of contents

How do I handle pagination when scraping multiple pages in Java?

Handling pagination is one of the most common challenges in web scraping, especially when dealing with large datasets spread across multiple pages. This comprehensive guide covers various pagination patterns and how to handle them effectively in Java using popular libraries like JSoup, HttpClient, and Selenium.

Understanding Pagination Types

Before diving into implementation, it's important to understand the different types of pagination you might encounter:

1. URL-based Pagination

The simplest form where page numbers are part of the URL: - https://example.com/products?page=1 - https://example.com/products?offset=0&limit=20 - https://example.com/products/page/2

2. Button-based Pagination

Pages with "Next" buttons or numbered page links that require clicking.

3. Infinite Scroll

Content loads dynamically as the user scrolls down the page.

4. Load More Buttons

A button that loads additional content when clicked.

URL-based Pagination with JSoup and HttpClient

Here's a comprehensive example using JSoup for HTML parsing and HttpClient for making requests:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;

public class PaginationScraper {
    private final HttpClient httpClient;
    private final String baseUrl;
    private final int maxRetries = 3;
    private final Duration requestTimeout = Duration.ofSeconds(30);

    public PaginationScraper(String baseUrl) {
        this.baseUrl = baseUrl;
        this.httpClient = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(10))
            .build();
    }

    public List<String> scrapeAllPages() {
        List<String> allData = new ArrayList<>();
        int currentPage = 1;
        boolean hasMorePages = true;

        while (hasMorePages) {
            try {
                String url = buildPageUrl(currentPage);
                Document page = fetchPage(url);

                if (page == null) {
                    System.err.println("Failed to fetch page " + currentPage);
                    break;
                }

                // Extract data from current page
                List<String> pageData = extractDataFromPage(page);
                allData.addAll(pageData);

                // Check if there are more pages
                hasMorePages = hasNextPage(page, currentPage);
                currentPage++;

                // Rate limiting - be respectful to the server
                Thread.sleep(1000);

                System.out.println("Scraped page " + (currentPage - 1) + 
                                 ", found " + pageData.size() + " items");

            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            } catch (Exception e) {
                System.err.println("Error processing page " + currentPage + ": " + 
                                 e.getMessage());
                break;
            }
        }

        return allData;
    }

    private String buildPageUrl(int pageNumber) {
        // Adapt this based on the site's URL structure
        return String.format("%s?page=%d", baseUrl, pageNumber);
    }

    private Document fetchPage(String url) {
        for (int attempt = 1; attempt <= maxRetries; attempt++) {
            try {
                HttpRequest request = HttpRequest.newBuilder()
                    .uri(URI.create(url))
                    .timeout(requestTimeout)
                    .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
                           "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
                    .GET()
                    .build();

                HttpResponse<String> response = httpClient.send(request, 
                    HttpResponse.BodyHandlers.ofString());

                if (response.statusCode() == 200) {
                    return Jsoup.parse(response.body());
                } else if (response.statusCode() == 404) {
                    // Likely reached end of pagination
                    return null;
                } else {
                    System.err.println("HTTP " + response.statusCode() + " for " + url);
                }

            } catch (Exception e) {
                System.err.println("Attempt " + attempt + " failed for " + url + 
                                 ": " + e.getMessage());
                if (attempt < maxRetries) {
                    try {
                        Thread.sleep(2000 * attempt); // Exponential backoff
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        return null;
                    }
                }
            }
        }
        return null;
    }

    private List<String> extractDataFromPage(Document page) {
        List<String> data = new ArrayList<>();

        // Adapt these selectors based on the site's structure
        Elements items = page.select(".product-item");

        for (Element item : items) {
            String title = item.select(".product-title").text();
            String price = item.select(".product-price").text();
            String link = item.select("a").attr("href");

            data.add(String.format("Title: %s, Price: %s, Link: %s", 
                                  title, price, link));
        }

        return data;
    }

    private boolean hasNextPage(Document page, int currentPage) {
        // Method 1: Check for "Next" button
        Elements nextButton = page.select("a.next-page, .pagination-next");
        if (!nextButton.isEmpty()) {
            return true;
        }

        // Method 2: Check if current page has minimum expected items
        Elements items = page.select(".product-item");
        int expectedItemsPerPage = 20; // Adjust based on site
        if (items.size() < expectedItemsPerPage) {
            return false;
        }

        // Method 3: Check pagination numbers
        Elements pageNumbers = page.select(".pagination a");
        for (Element pageLink : pageNumbers) {
            try {
                int pageNum = Integer.parseInt(pageLink.text().trim());
                if (pageNum > currentPage) {
                    return true;
                }
            } catch (NumberFormatException ignored) {
                // Not a page number
            }
        }

        return false;
    }
}

Advanced Pagination with Offset-based URLs

Many APIs and websites use offset-based pagination. Here's how to handle it:

public class OffsetPaginationScraper {
    private final HttpClient httpClient;
    private final String baseUrl;
    private final int itemsPerPage;

    public OffsetPaginationScraper(String baseUrl, int itemsPerPage) {
        this.baseUrl = baseUrl;
        this.itemsPerPage = itemsPerPage;
        this.httpClient = HttpClient.newHttpClient();
    }

    public List<String> scrapeWithOffset() {
        List<String> allData = new ArrayList<>();
        int offset = 0;
        boolean hasMoreData = true;

        while (hasMoreData) {
            String url = String.format("%s?offset=%d&limit=%d", 
                                     baseUrl, offset, itemsPerPage);

            try {
                Document page = fetchPage(url);
                if (page == null) break;

                List<String> pageData = extractDataFromPage(page);

                if (pageData.isEmpty() || pageData.size() < itemsPerPage) {
                    hasMoreData = false;
                }

                allData.addAll(pageData);
                offset += itemsPerPage;

                // Rate limiting
                Thread.sleep(1000);

            } catch (Exception e) {
                System.err.println("Error at offset " + offset + ": " + e.getMessage());
                break;
            }
        }

        return allData;
    }
}

Handling Dynamic Pagination with Selenium

For sites with JavaScript-based pagination, you'll need a browser automation tool like Selenium. This approach is particularly useful when dealing with complex single-page applications or infinite scroll mechanisms:

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.JavascriptExecutor;
import java.time.Duration;
import java.util.List;
import java.util.ArrayList;

public class DynamicPaginationScraper {
    private WebDriver driver;
    private WebDriverWait wait;

    public DynamicPaginationScraper() {
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless"); // Run in background
        options.addArguments("--no-sandbox");
        options.addArguments("--disable-dev-shm-usage");
        options.addArguments("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
                           "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");

        this.driver = new ChromeDriver(options);
        this.wait = new WebDriverWait(driver, Duration.ofSeconds(10));
    }

    public List<String> scrapeWithSelenium(String startUrl) {
        List<String> allData = new ArrayList<>();

        try {
            driver.get(startUrl);

            boolean hasMorePages = true;
            int pageCount = 1;

            while (hasMorePages) {
                // Wait for content to load
                wait.until(ExpectedConditions.presenceOfElementLocated(
                    By.className("product-item")));

                // Extract data from current page
                List<WebElement> items = driver.findElements(By.className("product-item"));

                for (WebElement item : items) {
                    try {
                        String title = item.findElement(By.className("product-title")).getText();
                        String price = item.findElement(By.className("product-price")).getText();
                        allData.add(String.format("Page %d - Title: %s, Price: %s", 
                                                 pageCount, title, price));
                    } catch (Exception e) {
                        // Handle missing elements gracefully
                        System.err.println("Error extracting item data: " + e.getMessage());
                    }
                }

                // Try to find and click next button
                hasMorePages = navigateToNextPage();
                pageCount++;

                // Add delay between page loads
                Thread.sleep(2000);

                System.out.println("Completed page " + (pageCount - 1));
            }

        } catch (Exception e) {
            System.err.println("Error during scraping: " + e.getMessage());
        } finally {
            driver.quit();
        }

        return allData;
    }

    private boolean navigateToNextPage() {
        try {
            // Method 1: Look for next button
            WebElement nextButton = driver.findElement(
                By.cssSelector("a.next-page, .pagination-next, button[aria-label='Next']"));

            if (nextButton.isEnabled()) {
                // Scroll to button if needed
                ((JavascriptExecutor) driver).executeScript(
                    "arguments[0].scrollIntoView(true);", nextButton);

                Thread.sleep(500);
                nextButton.click();

                // Wait for new content to load
                Thread.sleep(2000);
                return true;
            }

        } catch (Exception e) {
            // Next button not found or not clickable
            System.out.println("No more pages available");
        }

        return false;
    }

    // Handle infinite scroll pagination
    public List<String> scrapeInfiniteScroll(String url) {
        List<String> allData = new ArrayList<>();

        try {
            driver.get(url);

            JavascriptExecutor js = (JavascriptExecutor) driver;
            long lastHeight = (Long) js.executeScript("return document.body.scrollHeight");

            while (true) {
                // Extract current data
                List<WebElement> items = driver.findElements(By.className("product-item"));

                for (WebElement item : items) {
                    String itemText = item.getText();
                    if (!allData.contains(itemText)) { // Avoid duplicates
                        allData.add(itemText);
                    }
                }

                // Scroll to bottom
                js.executeScript("window.scrollTo(0, document.body.scrollHeight);");

                // Wait for new content to load
                Thread.sleep(3000);

                // Check if more content loaded
                long newHeight = (Long) js.executeScript("return document.body.scrollHeight");
                if (newHeight == lastHeight) {
                    break; // No more content
                }
                lastHeight = newHeight;
            }

        } catch (Exception e) {
            System.err.println("Error during infinite scroll: " + e.getMessage());
        } finally {
            driver.quit();
        }

        return allData;
    }
}

Handling Load More Buttons

Some sites use "Load More" buttons instead of traditional pagination:

public class LoadMoreScraper {
    private WebDriver driver;
    private WebDriverWait wait;

    public List<String> scrapeWithLoadMore(String url) {
        List<String> allData = new ArrayList<>();

        try {
            driver.get(url);

            while (true) {
                // Extract current items
                List<WebElement> items = driver.findElements(By.className("product-item"));
                int currentItemCount = items.size();

                for (WebElement item : items) {
                    String itemText = item.getText();
                    if (!allData.contains(itemText)) {
                        allData.add(itemText);
                    }
                }

                // Look for load more button
                try {
                    WebElement loadMoreButton = wait.until(
                        ExpectedConditions.elementToBeClickable(
                            By.cssSelector("button.load-more, .load-more-btn")));

                    loadMoreButton.click();

                    // Wait for new items to load
                    wait.until(ExpectedConditions.numberOfElementsToBeMoreThan(
                        By.className("product-item"), currentItemCount));

                    Thread.sleep(1000); // Additional buffer

                } catch (Exception e) {
                    // No more load button or timeout
                    break;
                }
            }

        } catch (Exception e) {
            System.err.println("Error during load more scraping: " + e.getMessage());
        } finally {
            if (driver != null) {
                driver.quit();
            }
        }

        return allData;
    }
}

Best Practices and Error Handling

1. Implement Robust Error Handling

public class RobustPaginationScraper {
    private static final int MAX_CONSECUTIVE_FAILURES = 3;
    private int consecutiveFailures = 0;

    private boolean shouldContinueScraping() {
        return consecutiveFailures < MAX_CONSECUTIVE_FAILURES;
    }

    private void handleSuccess() {
        consecutiveFailures = 0;
    }

    private void handleFailure() {
        consecutiveFailures++;
        System.err.println("Consecutive failures: " + consecutiveFailures);
    }
}

2. Respect Rate Limits

Always implement delays between requests and respect the website's robots.txt file:

private void respectRateLimit() throws InterruptedException {
    // Randomize delays to appear more human-like
    int delay = 1000 + (int) (Math.random() * 2000); // 1-3 seconds
    Thread.sleep(delay);
}

3. Use Connection Pooling

For high-volume scraping, use connection pooling to improve performance:

HttpClient httpClient = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(10))
    .executor(Executors.newFixedThreadPool(5))
    .build();

Conclusion

Handling pagination in Java requires understanding the specific pagination pattern used by the target website and choosing the appropriate scraping approach. For simple URL-based pagination, JSoup with HttpClient is efficient and lightweight. For complex JavaScript-heavy sites, Selenium provides the necessary browser automation capabilities.

Remember to always implement proper error handling, respect rate limits, and follow ethical scraping practices. Consider using browser automation tools for complex scenarios where traditional HTTP clients fall short.

The key to successful pagination handling is to start simple with URL-based approaches and progressively move to more sophisticated techniques like browser automation only when necessary. Always test your pagination logic thoroughly with edge cases like empty pages, network failures, and unexpected page structures.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon