Table of contents

How can I prevent my scraper from being blocked while using jsoup?

When web scraping with jsoup, websites employ various anti-bot measures to detect and block automated requests. To build a robust scraper that avoids detection, you need to implement strategies that make your requests appear more human-like. This guide covers proven techniques to prevent blocking while maintaining ethical scraping practices.

Essential Anti-Detection Techniques

1. User-Agent Rotation

Websites commonly check User-Agent strings to identify bots. Rotate between different realistic User-Agents to avoid detection:

import java.util.Arrays;
import java.util.List;
import java.util.Random;

public class UserAgentRotator {
    private static final List<String> USER_AGENTS = Arrays.asList(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15"
    );

    private static final Random random = new Random();

    public static String getRandomUserAgent() {
        return USER_AGENTS.get(random.nextInt(USER_AGENTS.size()));
    }
}

// Usage
String url = "https://example.com";
Document doc = Jsoup.connect(url)
                    .userAgent(UserAgentRotator.getRandomUserAgent())
                    .get();

2. Complete Browser Headers

Set comprehensive headers that match real browser requests:

public Document fetchWithBrowserHeaders(String url) throws IOException {
    return Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36")
                .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8")
                .header("Accept-Language", "en-US,en;q=0.9")
                .header("Accept-Encoding", "gzip, deflate, br")
                .header("Connection", "keep-alive")
                .header("Upgrade-Insecure-Requests", "1")
                .header("Sec-Fetch-Dest", "document")
                .header("Sec-Fetch-Mode", "navigate")
                .header("Sec-Fetch-Site", "none")
                .header("Cache-Control", "max-age=0")
                .referrer("https://www.google.com")
                .get();
}

3. Session Management with Cookies

Maintain session state by properly handling cookies across requests:

public class SessionManager {
    private Map<String, String> cookies = new HashMap<>();

    public Document fetchWithSession(String url) throws IOException {
        Connection.Response response = Jsoup.connect(url)
                                           .userAgent(UserAgentRotator.getRandomUserAgent())
                                           .cookies(cookies)
                                           .method(Connection.Method.GET)
                                           .execute();

        // Update cookies from response
        cookies.putAll(response.cookies());

        return response.parse();
    }

    public void clearSession() {
        cookies.clear();
    }
}

4. Smart Rate Limiting

Implement human-like delays with randomization:

public class SmartScraper {
    private static final Random random = new Random();

    public void scrapeMultiplePages(List<String> urls) throws IOException, InterruptedException {
        for (String url : urls) {
            Document doc = Jsoup.connect(url)
                               .userAgent(UserAgentRotator.getRandomUserAgent())
                               .get();

            // Process document
            processDocument(doc);

            // Random delay between 1-3 seconds
            int delay = 1000 + random.nextInt(2000);
            Thread.sleep(delay);
        }
    }

    private void processDocument(Document doc) {
        // Your scraping logic here
    }
}

5. Proxy Rotation

Rotate IP addresses using proxy servers:

public class ProxyRotator {
    private List<ProxyInfo> proxies;
    private int currentIndex = 0;

    public static class ProxyInfo {
        String host;
        int port;

        public ProxyInfo(String host, int port) {
            this.host = host;
            this.port = port;
        }
    }

    public Document fetchWithProxy(String url) throws IOException {
        ProxyInfo proxyInfo = getNextProxy();
        Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyInfo.host, proxyInfo.port));

        return Jsoup.connect(url)
                    .proxy(proxy)
                    .userAgent(UserAgentRotator.getRandomUserAgent())
                    .timeout(10000)
                    .get();
    }

    private ProxyInfo getNextProxy() {
        ProxyInfo proxy = proxies.get(currentIndex);
        currentIndex = (currentIndex + 1) % proxies.size();
        return proxy;
    }
}

Advanced Anti-Detection Strategies

6. Connection Timeout and Retry Logic

Handle network issues gracefully with proper timeouts and retries:

public Document fetchWithRetry(String url, int maxRetries) throws IOException {
    for (int i = 0; i < maxRetries; i++) {
        try {
            return Jsoup.connect(url)
                        .userAgent(UserAgentRotator.getRandomUserAgent())
                        .timeout(15000)
                        .followRedirects(true)
                        .ignoreHttpErrors(false)
                        .get();
        } catch (IOException e) {
            if (i == maxRetries - 1) {
                throw e;
            }

            try {
                // Exponential backoff
                Thread.sleep((long) Math.pow(2, i) * 1000);
            } catch (InterruptedException ie) {
                Thread.currentThread().interrupt();
                throw new IOException("Interrupted during retry", ie);
            }
        }
    }
    return null;
}

7. Handle JavaScript-Heavy Sites

For sites that heavily rely on JavaScript, consider hybrid approaches:

// For light JavaScript detection
public boolean requiresJavaScript(String url) throws IOException {
    Document doc = Jsoup.connect(url).get();

    // Check for common JavaScript loading indicators
    Elements scripts = doc.select("script");
    Elements noscript = doc.select("noscript");

    return scripts.size() > 5 || !noscript.isEmpty() || 
           doc.text().contains("Please enable JavaScript");
}

// Alternative: Use headless browser for JS-heavy sites
// Consider Selenium WebDriver or HtmlUnit for such cases

Ethical Scraping Practices

8. Respect Robots.txt

Always check and respect the robots.txt file:

public class RobotsTxtChecker {
    public boolean isAllowed(String baseUrl, String path, String userAgent) {
        try {
            String robotsUrl = baseUrl + "/robots.txt";
            Document robotsDoc = Jsoup.connect(robotsUrl)
                                     .ignoreContentType(true)
                                     .get();

            String robotsContent = robotsDoc.text();

            // Simple robots.txt parsing (consider using a dedicated library for complex rules)
            return !robotsContent.toLowerCase().contains("disallow: " + path.toLowerCase());
        } catch (IOException e) {
            // If robots.txt is not accessible, proceed with caution
            return true;
        }
    }
}

9. Monitor Server Response

Watch for signs that you're being detected:

public void monitorResponse(Connection.Response response) {
    int statusCode = response.statusCode();

    if (statusCode == 403) {
        System.out.println("Forbidden: Possible bot detection");
    } else if (statusCode == 429) {
        System.out.println("Rate limited: Reduce request frequency");
    } else if (statusCode == 503) {
        System.out.println("Service unavailable: Server might be overloaded");
    }

    // Check for CAPTCHA indicators in response
    String body = response.body();
    if (body.toLowerCase().contains("captcha") || body.toLowerCase().contains("verify")) {
        System.out.println("CAPTCHA detected: Human verification required");
    }
}

Complete Example

Here's a comprehensive scraper implementation:

public class EthicalScraper {
    private SessionManager sessionManager = new SessionManager();
    private ProxyRotator proxyRotator = new ProxyRotator();
    private RobotsTxtChecker robotsChecker = new RobotsTxtChecker();

    public Document scrape(String url) throws IOException, InterruptedException {
        // Check robots.txt
        URI uri = URI.create(url);
        String baseUrl = uri.getScheme() + "://" + uri.getHost();
        if (!robotsChecker.isAllowed(baseUrl, uri.getPath(), "jsoup-scraper")) {
            throw new IllegalArgumentException("Scraping not allowed by robots.txt");
        }

        // Implement delay
        Thread.sleep(1000 + new Random().nextInt(2000));

        // Fetch with full browser simulation
        return fetchWithRetry(url, 3);
    }

    private Document fetchWithRetry(String url, int maxRetries) throws IOException {
        // Implementation from previous example
        return null;
    }
}

Best Practices Summary

  1. Always be respectful: Don't overload servers with too many concurrent requests
  2. Follow legal guidelines: Respect terms of service and applicable laws
  3. Use reasonable delays: Mimic human browsing patterns
  4. Monitor your impact: Watch server responses and adjust accordingly
  5. Consider alternatives: For heavily protected sites, consider official APIs
  6. Stay updated: Keep your User-Agent strings and techniques current

Remember that anti-detection techniques are constantly evolving, and websites may still detect sophisticated scrapers. The key is to be respectful, ethical, and adaptive in your approach while following all applicable laws and terms of service.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon