Table of contents

How do I use jsoup with a proxy?

Using a proxy with Jsoup is essential for bypassing IP restrictions, avoiding rate limiting, and maintaining anonymity during web scraping. Jsoup provides multiple ways to configure proxy settings for your HTTP requests.

Quick Start: Direct Connection Proxy

The simplest approach is to configure the proxy directly on the Jsoup connection:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

String url = "https://example.com";
Document doc = Jsoup.connect(url)
    .proxy("proxy.example.com", 8080)
    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
    .timeout(10000)
    .get();

System.out.println(doc.title());

Method 1: System Properties (Global Configuration)

Set proxy properties globally for all HTTP connections in your application:

// HTTP proxy configuration
System.setProperty("http.proxyHost", "proxy.example.com");
System.setProperty("http.proxyPort", "8080");
System.setProperty("https.proxyHost", "proxy.example.com");
System.setProperty("https.proxyPort", "8080");

// Optional: Bypass proxy for certain hosts
System.setProperty("http.nonProxyHosts", "localhost|127.*|[::1]");

// Use Jsoup normally - proxy will be used automatically
Document doc = Jsoup.connect("https://example.com").get();

Method 2: Per-Connection Configuration

Configure proxy settings for individual connections:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class ProxyExample {
    public static void main(String[] args) {
        try {
            Connection connection = Jsoup.connect("https://httpbin.org/ip")
                .proxy("proxy.example.com", 8080)
                .userAgent("Mozilla/5.0 (compatible; JavaBot/1.0)")
                .timeout(15000)
                .ignoreContentType(true);

            Document doc = connection.get();
            System.out.println("Response: " + doc.text());
        } catch (IOException e) {
            System.err.println("Failed to connect through proxy: " + e.getMessage());
        }
    }
}

Proxy Authentication

Basic Authentication

For proxies requiring username and password authentication:

import java.net.Authenticator;
import java.net.PasswordAuthentication;

public class AuthenticatedProxyExample {
    public static void setupProxyAuth(String username, String password) {
        Authenticator.setDefault(new Authenticator() {
            @Override
            protected PasswordAuthentication getPasswordAuthentication() {
                if (getRequestorType() == RequestorType.PROXY) {
                    return new PasswordAuthentication(username, password.toCharArray());
                }
                return null;
            }
        });
    }

    public static void main(String[] args) throws IOException {
        // Setup authentication
        setupProxyAuth("proxy_user", "proxy_password");

        // Configure and use proxy
        Document doc = Jsoup.connect("https://example.com")
            .proxy("authenticated-proxy.example.com", 8080)
            .get();

        System.out.println(doc.title());
    }
}

Alternative: System Properties Authentication

System.setProperty("http.proxyHost", "proxy.example.com");
System.setProperty("http.proxyPort", "8080");
System.setProperty("http.proxyUser", "your_username");
System.setProperty("http.proxyPassword", "your_password");

// For HTTPS
System.setProperty("https.proxyHost", "proxy.example.com");
System.setProperty("https.proxyPort", "8080");
System.setProperty("https.proxyUser", "your_username");
System.setProperty("https.proxyPassword", "your_password");

SOCKS Proxy Configuration

For SOCKS4/SOCKS5 proxies:

import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.Socket;

// Method 1: System properties
System.setProperty("socksProxyHost", "socks-proxy.example.com");
System.setProperty("socksProxyPort", "1080");
System.setProperty("socksProxyVersion", "5"); // or "4"

// Method 2: Using Java's Proxy class (for custom implementations)
public class SocksProxyExample {
    public static Document connectViaSocks(String url, String proxyHost, int proxyPort) 
            throws IOException {
        // Note: Jsoup doesn't directly support Proxy objects
        // You'll need to use system properties for SOCKS
        System.setProperty("socksProxyHost", proxyHost);
        System.setProperty("socksProxyPort", String.valueOf(proxyPort));

        return Jsoup.connect(url).get();
    }
}

Error Handling and Retry Logic

Implement robust error handling when using proxies:

import org.jsoup.HttpStatusException;
import java.net.SocketTimeoutException;
import java.net.ConnectException;

public class RobustProxyClient {
    private static final int MAX_RETRIES = 3;
    private static final int TIMEOUT = 10000;

    public static Document fetchWithRetry(String url, String proxyHost, int proxyPort) {
        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                return Jsoup.connect(url)
                    .proxy(proxyHost, proxyPort)
                    .timeout(TIMEOUT)
                    .userAgent("Mozilla/5.0 (compatible; JavaBot/1.0)")
                    .get();

            } catch (HttpStatusException e) {
                System.err.println("HTTP " + e.getStatusCode() + " on attempt " + attempt);
                if (e.getStatusCode() == 407) {
                    throw new RuntimeException("Proxy authentication required", e);
                }
            } catch (SocketTimeoutException e) {
                System.err.println("Timeout on attempt " + attempt);
            } catch (ConnectException e) {
                System.err.println("Connection failed on attempt " + attempt + ": " + e.getMessage());
            } catch (IOException e) {
                System.err.println("IO error on attempt " + attempt + ": " + e.getMessage());
            }

            if (attempt < MAX_RETRIES) {
                try {
                    Thread.sleep(2000 * attempt); // Exponential backoff
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    break;
                }
            }
        }
        throw new RuntimeException("Failed to fetch after " + MAX_RETRIES + " attempts");
    }
}

Testing Proxy Configuration

Verify your proxy setup with a simple test:

public class ProxyTester {
    public static void testProxy(String proxyHost, int proxyPort) {
        try {
            // Test without proxy
            Document directDoc = Jsoup.connect("https://httpbin.org/ip").get();
            System.out.println("Direct IP: " + directDoc.text());

            // Test with proxy
            Document proxyDoc = Jsoup.connect("https://httpbin.org/ip")
                .proxy(proxyHost, proxyPort)
                .get();
            System.out.println("Proxy IP: " + proxyDoc.text());

        } catch (IOException e) {
            System.err.println("Proxy test failed: " + e.getMessage());
        }
    }
}

Multiple Proxy Support

Rotate between multiple proxies for better reliability:

import java.util.Arrays;
import java.util.List;
import java.util.Random;

public class ProxyRotator {
    private final List<ProxyInfo> proxies;
    private final Random random = new Random();

    public ProxyRotator(List<ProxyInfo> proxies) {
        this.proxies = proxies;
    }

    public Document fetch(String url) throws IOException {
        ProxyInfo proxy = proxies.get(random.nextInt(proxies.size()));

        return Jsoup.connect(url)
            .proxy(proxy.host, proxy.port)
            .timeout(10000)
            .get();
    }

    static class ProxyInfo {
        final String host;
        final int port;

        ProxyInfo(String host, int port) {
            this.host = host;
            this.port = port;
        }
    }

    // Usage example
    public static void main(String[] args) throws IOException {
        List<ProxyInfo> proxies = Arrays.asList(
            new ProxyInfo("proxy1.example.com", 8080),
            new ProxyInfo("proxy2.example.com", 8080),
            new ProxyInfo("proxy3.example.com", 8080)
        );

        ProxyRotator rotator = new ProxyRotator(proxies);
        Document doc = rotator.fetch("https://example.com");
        System.out.println(doc.title());
    }
}

Best Practices

  1. Always set timeouts when using proxies to avoid hanging connections
  2. Implement retry logic for handling temporary proxy failures
  3. Rotate User-Agent headers to avoid detection
  4. Test proxy connectivity before using in production
  5. Handle authentication errors (HTTP 407) appropriately
  6. Use HTTPS proxies for secure data transmission
  7. Monitor proxy performance and switch if response times degrade

Security Considerations

  • Avoid logging proxy credentials in application logs
  • Use encrypted connections (HTTPS) when transmitting sensitive data
  • Validate proxy certificates to prevent man-in-the-middle attacks
  • Rotate proxy credentials regularly
  • Monitor for proxy abuse that could compromise your application

Remember to replace placeholder values (proxy.example.com, your_username, etc.) with your actual proxy server details.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon