Table of contents

How do I handle rate limiting and implement delays between requests with jsoup?

Rate limiting is a crucial aspect of responsible web scraping that helps prevent server overload and reduces the risk of being blocked by target websites. When using jsoup for web scraping, implementing proper delays and rate limiting strategies ensures your scraper operates ethically and sustainably.

Understanding Rate Limiting

Rate limiting controls the frequency of requests sent to a server within a specific time period. Most websites implement rate limiting to:

  • Prevent server overload and maintain performance
  • Protect against denial-of-service attacks
  • Ensure fair resource usage among users
  • Maintain service quality for legitimate users

When scraping with jsoup, exceeding rate limits can result in: - HTTP 429 (Too Many Requests) errors - IP address blocking - CAPTCHA challenges - Temporary or permanent access restrictions

Basic Delay Implementation

The simplest approach to rate limiting with jsoup is implementing fixed delays between requests using Thread.sleep():

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.List;
import java.util.Arrays;

public class BasicRateLimitedScraper {
    private static final int DELAY_MS = 2000; // 2 seconds between requests

    public void scrapeUrls(List<String> urls) {
        for (String url : urls) {
            try {
                // Fetch the page
                Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
                    .timeout(10000)
                    .get();

                // Process the document
                processDocument(doc, url);

                // Implement delay (except for the last URL)
                if (!url.equals(urls.get(urls.size() - 1))) {
                    Thread.sleep(DELAY_MS);
                }

            } catch (IOException e) {
                System.err.println("Error fetching " + url + ": " + e.getMessage());
            } catch (InterruptedException e) {
                System.err.println("Sleep interrupted: " + e.getMessage());
                Thread.currentThread().interrupt();
                break;
            }
        }
    }

    private void processDocument(Document doc, String url) {
        System.out.println("Processing: " + url);
        System.out.println("Title: " + doc.title());
        // Add your scraping logic here
    }
}

Advanced Rate Limiting with Token Bucket

For more sophisticated rate limiting, implement a token bucket algorithm that allows burst requests while maintaining an average rate:

import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

public class TokenBucketRateLimiter {
    private final AtomicInteger tokens;
    private final int maxTokens;
    private final int refillRate;
    private final ScheduledExecutorService scheduler;

    public TokenBucketRateLimiter(int maxTokens, int refillRate) {
        this.maxTokens = maxTokens;
        this.refillRate = refillRate;
        this.tokens = new AtomicInteger(maxTokens);
        this.scheduler = Executors.newScheduledThreadPool(1);

        // Refill tokens at specified rate
        scheduler.scheduleAtFixedRate(this::refillTokens, 1, 1, TimeUnit.SECONDS);
    }

    private void refillTokens() {
        tokens.updateAndGet(current -> Math.min(maxTokens, current + refillRate));
    }

    public boolean tryAcquire() {
        return tokens.updateAndGet(current -> current > 0 ? current - 1 : current) > 0;
    }

    public void acquire() throws InterruptedException {
        while (!tryAcquire()) {
            Thread.sleep(100); // Check every 100ms
        }
    }

    public void shutdown() {
        scheduler.shutdown();
    }
}

Using the token bucket with jsoup:

public class RateLimitedJsoupScraper {
    private final TokenBucketRateLimiter rateLimiter;

    public RateLimitedJsoupScraper() {
        // Allow 10 requests initially, refill 1 token per second
        this.rateLimiter = new TokenBucketRateLimiter(10, 1);
    }

    public Document fetchDocument(String url) throws IOException, InterruptedException {
        // Wait for available token
        rateLimiter.acquire();

        return Jsoup.connect(url)
            .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
            .timeout(10000)
            .get();
    }

    public void scrapeMultiplePages(List<String> urls) {
        for (String url : urls) {
            try {
                Document doc = fetchDocument(url);
                processDocument(doc, url);
            } catch (Exception e) {
                System.err.println("Error processing " + url + ": " + e.getMessage());
            }
        }

        rateLimiter.shutdown();
    }
}

Exponential Backoff for Error Handling

Implement exponential backoff to handle rate limiting errors gracefully:

import java.io.IOException;
import org.jsoup.HttpStatusException;
import java.util.Random;

public class ExponentialBackoffScraper {
    private static final int MAX_RETRIES = 3;
    private static final int BASE_DELAY_MS = 1000;
    private final Random random = new Random();

    public Document fetchWithBackoff(String url) throws IOException {
        int attempt = 0;

        while (attempt < MAX_RETRIES) {
            try {
                return Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
                    .timeout(10000)
                    .get();

            } catch (HttpStatusException e) {
                if (e.getStatusCode() == 429 || e.getStatusCode() >= 500) {
                    attempt++;
                    if (attempt >= MAX_RETRIES) {
                        throw new IOException("Max retries exceeded for " + url, e);
                    }

                    // Calculate exponential backoff with jitter
                    int delay = (int) (BASE_DELAY_MS * Math.pow(2, attempt)) + 
                               random.nextInt(1000);

                    System.out.println("Rate limited. Waiting " + delay + "ms before retry...");

                    try {
                        Thread.sleep(delay);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new IOException("Interrupted during backoff", ie);
                    }
                } else {
                    throw e;
                }
            }
        }

        throw new IOException("Failed to fetch after " + MAX_RETRIES + " attempts");
    }
}

Respecting robots.txt

Always check and respect the robots.txt file to understand crawling guidelines:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.HashMap;
import java.util.Map;

public class RobotsTxtParser {
    private Map<String, Integer> crawlDelays = new HashMap<>();

    public void parseRobotsTxt(String baseUrl) {
        try {
            URL robotsUrl = new URL(baseUrl + "/robots.txt");
            BufferedReader reader = new BufferedReader(
                new InputStreamReader(robotsUrl.openStream())
            );

            String line;
            String currentUserAgent = null;

            while ((line = reader.readLine()) != null) {
                line = line.trim().toLowerCase();

                if (line.startsWith("user-agent:")) {
                    currentUserAgent = line.substring(11).trim();
                } else if (line.startsWith("crawl-delay:") && currentUserAgent != null) {
                    try {
                        int delay = Integer.parseInt(line.substring(12).trim());
                        crawlDelays.put(currentUserAgent, delay * 1000); // Convert to milliseconds
                    } catch (NumberFormatException e) {
                        // Invalid delay format, ignore
                    }
                }
            }

            reader.close();
        } catch (Exception e) {
            System.err.println("Could not parse robots.txt: " + e.getMessage());
        }
    }

    public int getCrawlDelay(String userAgent) {
        return crawlDelays.getOrDefault(userAgent.toLowerCase(), 1000); // Default 1 second
    }
}

Adaptive Rate Limiting

Implement adaptive rate limiting that adjusts based on server responses:

public class AdaptiveRateLimiter {
    private volatile int currentDelay = 1000; // Start with 1 second
    private final int minDelay = 500;
    private final int maxDelay = 30000;
    private final double increaseMultiplier = 1.5;
    private final double decreaseMultiplier = 0.9;

    public Document fetchAdaptively(String url) throws IOException, InterruptedException {
        while (true) {
            try {
                // Apply current delay
                Thread.sleep(currentDelay);

                long startTime = System.currentTimeMillis();
                Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
                    .timeout(10000)
                    .get();

                long responseTime = System.currentTimeMillis() - startTime;

                // Adjust delay based on response time
                if (responseTime < 500) {
                    // Fast response, can decrease delay
                    currentDelay = Math.max(minDelay, 
                        (int) (currentDelay * decreaseMultiplier));
                } else if (responseTime > 2000) {
                    // Slow response, increase delay
                    currentDelay = Math.min(maxDelay, 
                        (int) (currentDelay * increaseMultiplier));
                }

                return doc;

            } catch (HttpStatusException e) {
                if (e.getStatusCode() == 429) {
                    // Rate limited, increase delay significantly
                    currentDelay = Math.min(maxDelay, 
                        (int) (currentDelay * increaseMultiplier * 2));
                    System.out.println("Rate limited. Increasing delay to " + currentDelay + "ms");
                    continue;
                } else {
                    throw e;
                }
            }
        }
    }
}

Concurrent Scraping with Rate Limiting

For high-volume scraping, use a thread pool with rate limiting:

import java.util.concurrent.*;

public class ConcurrentRateLimitedScraper {
    private final ExecutorService executor;
    private final Semaphore semaphore;
    private final ScheduledExecutorService rateLimitScheduler;

    public ConcurrentRateLimitedScraper(int maxConcurrentRequests, int requestsPerSecond) {
        this.executor = Executors.newFixedThreadPool(maxConcurrentRequests);
        this.semaphore = new Semaphore(requestsPerSecond);
        this.rateLimitScheduler = Executors.newScheduledThreadPool(1);

        // Release permits at specified rate
        rateLimitScheduler.scheduleAtFixedRate(() -> {
            semaphore.release(Math.min(requestsPerSecond, 
                requestsPerSecond - semaphore.availablePermits()));
        }, 1, 1, TimeUnit.SECONDS);
    }

    public CompletableFuture<Document> scrapeAsync(String url) {
        return CompletableFuture.supplyAsync(() -> {
            try {
                semaphore.acquire(); // Wait for rate limit permit

                return Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
                    .timeout(10000)
                    .get();

            } catch (Exception e) {
                throw new RuntimeException("Failed to scrape " + url, e);
            }
        }, executor);
    }

    public void shutdown() {
        executor.shutdown();
        rateLimitScheduler.shutdown();
    }
}

Configuration-Based Rate Limiting

Create a configurable rate limiter for different websites:

public class ConfigurableRateLimiter {
    private final Map<String, SiteConfig> siteConfigs;

    public static class SiteConfig {
        public final int delayMs;
        public final int maxConcurrent;
        public final boolean respectRobotsTxt;

        public SiteConfig(int delayMs, int maxConcurrent, boolean respectRobotsTxt) {
            this.delayMs = delayMs;
            this.maxConcurrent = maxConcurrent;
            this.respectRobotsTxt = respectRobotsTxt;
        }
    }

    public ConfigurableRateLimiter() {
        this.siteConfigs = new HashMap<>();

        // Configure different sites
        siteConfigs.put("example.com", new SiteConfig(2000, 1, true));
        siteConfigs.put("api.github.com", new SiteConfig(1000, 2, false));
        siteConfigs.put("default", new SiteConfig(3000, 1, true));
    }

    public Document fetchWithConfig(String url) throws IOException, InterruptedException {
        String domain = extractDomain(url);
        SiteConfig config = siteConfigs.getOrDefault(domain, siteConfigs.get("default"));

        // Apply configured delay
        Thread.sleep(config.delayMs);

        return Jsoup.connect(url)
            .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
            .timeout(10000)
            .get();
    }

    private String extractDomain(String url) {
        try {
            return new URL(url).getHost().toLowerCase();
        } catch (Exception e) {
            return "default";
        }
    }
}

Best Practices for Rate Limiting

  1. Start Conservative: Begin with longer delays and gradually optimize based on server responses.

  2. Monitor Response Times: Track server response times to detect when you're approaching limits.

  3. Handle Errors Gracefully: Always implement proper error handling for rate limit responses.

  4. Use Connection Pooling: Reuse connections when possible to reduce overhead:

// Configure connection settings globally
Connection connection = Jsoup.connect(url)
    .maxBodySize(0) // Unlimited body size
    .timeout(30000) // 30 second timeout
    .followRedirects(true)
    .ignoreHttpErrors(false);
  1. Implement Circuit Breakers: Stop making requests temporarily if too many failures occur.

Real-World Example: E-commerce Scraper

Here's a practical example that combines multiple rate limiting strategies:

public class EcommerceScraper {
    private final TokenBucketRateLimiter rateLimiter;
    private final RobotsTxtParser robotsParser;
    private final Map<String, Long> lastRequestTime = new ConcurrentHashMap<>();

    public EcommerceScraper() {
        this.rateLimiter = new TokenBucketRateLimiter(5, 1); // 5 requests burst, 1 per second
        this.robotsParser = new RobotsTxtParser();
    }

    public List<Product> scrapeProducts(List<String> productUrls) {
        List<Product> products = new ArrayList<>();

        for (String url : productUrls) {
            try {
                // Respect per-domain delays
                enforcePerDomainDelay(url);

                // Wait for rate limiter token
                rateLimiter.acquire();

                Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (compatible; EcommerceScraper/1.0)")
                    .timeout(15000)
                    .get();

                Product product = extractProductInfo(doc);
                if (product != null) {
                    products.add(product);
                }

            } catch (Exception e) {
                System.err.println("Failed to scrape " + url + ": " + e.getMessage());
            }
        }

        rateLimiter.shutdown();
        return products;
    }

    private void enforcePerDomainDelay(String url) throws InterruptedException {
        String domain = extractDomain(url);
        Long lastRequest = lastRequestTime.get(domain);

        if (lastRequest != null) {
            long timeSinceLastRequest = System.currentTimeMillis() - lastRequest;
            long minDelay = robotsParser.getCrawlDelay("*");

            if (timeSinceLastRequest < minDelay) {
                Thread.sleep(minDelay - timeSinceLastRequest);
            }
        }

        lastRequestTime.put(domain, System.currentTimeMillis());
    }

    private Product extractProductInfo(Document doc) {
        // Extract product information from the document
        String name = doc.select("h1.product-title").text();
        String price = doc.select(".price").text();

        if (!name.isEmpty() && !price.isEmpty()) {
            return new Product(name, price);
        }

        return null;
    }

    private String extractDomain(String url) {
        try {
            return new URL(url).getHost().toLowerCase();
        } catch (Exception e) {
            return "unknown";
        }
    }

    static class Product {
        final String name;
        final String price;

        Product(String name, String price) {
            this.name = name;
            this.price = price;
        }
    }
}

Conclusion

Proper rate limiting with jsoup requires a combination of delays, error handling, and adaptive strategies. While jsoup doesn't have built-in rate limiting, implementing these patterns ensures your web scraping remains respectful and sustainable. For more complex scenarios involving JavaScript-heavy sites, consider using tools like Puppeteer for handling timeouts or exploring browser automation solutions that offer more sophisticated rate limiting capabilities.

Remember that rate limiting is not just about avoiding blocks—it's about being a good citizen of the web and ensuring your scraping activities don't negatively impact the services you're accessing.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon