How do I handle redirects when scraping with jsoup?

Jsoup is a popular Java library for working with real-world HTML. When scraping web pages with jsoup, you'll often encounter HTTP redirects. By default, jsoup follows HTTP redirects automatically for up to 20 redirects (not 10 as commonly believed).

Understanding how to properly handle redirects is crucial for robust web scraping applications. You might need to:

  1. Detect if a redirect has occurred
  2. Capture the final URL after redirects
  3. Control the maximum number of redirects
  4. Disable automatic redirects for manual handling
  5. Handle different redirect status codes (301, 302, 303, 307, 308)

Default Redirect Behavior

By default, jsoup automatically follows redirects:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class DefaultRedirectBehavior {
    public static void main(String[] args) {
        try {
            // Jsoup automatically follows redirects by default
            Document doc = Jsoup.connect("https://httpbin.org/redirect/3").get();
            System.out.println("Title: " + doc.title());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

1. Detecting Redirects and Capturing Final URL

To check if redirects occurred, compare the original URL with the final URL:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class DetectRedirects {
    public static void main(String[] args) {
        try {
            String originalUrl = "https://httpbin.org/redirect-to?url=https://example.com";

            // Create connection and execute
            Connection connection = Jsoup.connect(originalUrl);
            Connection.Response response = connection.execute();
            Document doc = response.parse();

            String finalUrl = response.url().toString();

            if (!originalUrl.equals(finalUrl)) {
                System.out.println("Redirect detected!");
                System.out.println("Original URL: " + originalUrl);
                System.out.println("Final URL: " + finalUrl);
                System.out.println("Status Code: " + response.statusCode());
            } else {
                System.out.println("No redirect occurred");
            }

            // Process the document
            System.out.println("Page title: " + doc.title());

        } catch (IOException e) {
            System.err.println("Error: " + e.getMessage());
        }
    }
}

2. Controlling Maximum Redirects

Jsoup doesn't have a direct method to limit redirects, but you can implement custom logic:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class LimitRedirects {
    private static final int MAX_REDIRECTS = 5;

    public static Document getWithRedirectLimit(String url, int maxRedirects) throws IOException {
        String currentUrl = url;
        int redirectCount = 0;

        while (redirectCount <= maxRedirects) {
            Connection.Response response = Jsoup.connect(currentUrl)
                    .followRedirects(false)
                    .execute();

            int statusCode = response.statusCode();

            // Check if it's a redirect status code
            if (statusCode >= 300 && statusCode < 400) {
                if (redirectCount >= maxRedirects) {
                    throw new IOException("Too many redirects. Limit: " + maxRedirects);
                }

                String location = response.header("Location");
                if (location == null) {
                    throw new IOException("Redirect without Location header");
                }

                // Handle relative URLs
                if (location.startsWith("/")) {
                    location = response.url().getProtocol() + "://" + 
                              response.url().getHost() + location;
                }

                currentUrl = location;
                redirectCount++;
                System.out.println("Redirect " + redirectCount + ": " + currentUrl);

            } else if (statusCode == 200) {
                // Success - parse and return document
                return response.parse();
            } else {
                throw new IOException("HTTP error: " + statusCode);
            }
        }

        throw new IOException("Maximum redirects exceeded");
    }

    public static void main(String[] args) {
        try {
            Document doc = getWithRedirectLimit("https://httpbin.org/redirect/3", MAX_REDIRECTS);
            System.out.println("Successfully loaded page: " + doc.title());
        } catch (IOException e) {
            System.err.println("Error: " + e.getMessage());
        }
    }
}

3. Disabling Automatic Redirects

Disable redirects to handle them manually:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class ManualRedirectHandling {
    public static void main(String[] args) {
        try {
            String url = "https://httpbin.org/redirect/1";

            Connection.Response response = Jsoup.connect(url)
                    .followRedirects(false)
                    .execute();

            int statusCode = response.statusCode();
            System.out.println("Status Code: " + statusCode);

            if (isRedirect(statusCode)) {
                String location = response.header("Location");
                System.out.println("Redirect Location: " + location);

                // Decide whether to follow the redirect
                if (shouldFollowRedirect(location)) {
                    System.out.println("Following redirect...");
                    Document doc = Jsoup.connect(location).get();
                    System.out.println("Final page title: " + doc.title());
                } else {
                    System.out.println("Choosing not to follow redirect");
                }

            } else if (statusCode == 200) {
                Document doc = response.parse();
                System.out.println("Page title: " + doc.title());
            }

        } catch (IOException e) {
            System.err.println("Error: " + e.getMessage());
        }
    }

    private static boolean isRedirect(int statusCode) {
        return statusCode == 301 || statusCode == 302 || 
               statusCode == 303 || statusCode == 307 || statusCode == 308;
    }

    private static boolean shouldFollowRedirect(String location) {
        // Custom logic to decide whether to follow redirect
        // For example, only follow redirects to same domain
        return location != null && !location.contains("malicious-site.com");
    }
}

4. Handling Different Redirect Types

Different redirect status codes have different meanings:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
import java.util.Map;

public class RedirectTypeHandler {
    public static void handleRedirect(String url) throws IOException {
        Connection.Response response = Jsoup.connect(url)
                .followRedirects(false)
                .execute();

        int statusCode = response.statusCode();
        String location = response.header("Location");

        switch (statusCode) {
            case 301:
                System.out.println("301 Moved Permanently - Update bookmarks");
                break;
            case 302:
                System.out.println("302 Found - Temporary redirect");
                break;
            case 303:
                System.out.println("303 See Other - Use GET for next request");
                break;
            case 307:
                System.out.println("307 Temporary Redirect - Keep same method");
                break;
            case 308:
                System.out.println("308 Permanent Redirect - Keep same method");
                break;
            default:
                if (statusCode >= 300 && statusCode < 400) {
                    System.out.println("Other redirect: " + statusCode);
                }
        }

        if (location != null) {
            System.out.println("Redirect to: " + location);
        }
    }
}

5. Best Practices for Redirect Handling

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;

public class RobustRedirectHandler {
    private static final int MAX_REDIRECTS = 10;
    private static final Set<String> visitedUrls = new HashSet<>();

    public static Document safeGet(String initialUrl) throws IOException {
        visitedUrls.clear();
        return followRedirects(initialUrl, 0);
    }

    private static Document followRedirects(String url, int redirectCount) throws IOException {
        if (redirectCount > MAX_REDIRECTS) {
            throw new IOException("Too many redirects");
        }

        // Prevent infinite redirect loops
        if (visitedUrls.contains(url)) {
            throw new IOException("Redirect loop detected");
        }
        visitedUrls.add(url);

        Connection.Response response = Jsoup.connect(url)
                .followRedirects(false)
                .userAgent("Mozilla/5.0 (compatible; Bot/1.0)")
                .timeout(10000)
                .execute();

        int statusCode = response.statusCode();

        if (statusCode >= 300 && statusCode < 400) {
            String location = response.header("Location");
            if (location == null) {
                throw new IOException("Redirect without Location header");
            }

            // Resolve relative URLs
            URL baseUrl = new URL(url);
            URL redirectUrl = new URL(baseUrl, location);

            System.out.println("Redirecting from " + url + " to " + redirectUrl);
            return followRedirects(redirectUrl.toString(), redirectCount + 1);

        } else if (statusCode == 200) {
            return response.parse();
        } else {
            throw new IOException("HTTP " + statusCode + ": " + response.statusMessage());
        }
    }

    public static void main(String[] args) {
        try {
            Document doc = safeGet("https://httpbin.org/redirect/3");
            System.out.println("Successfully loaded: " + doc.title());
        } catch (IOException e) {
            System.err.println("Failed: " + e.getMessage());
        }
    }
}

Key Takeaways

  • Default behavior: Jsoup follows up to 20 redirects automatically
  • Detection: Compare original and final URLs to detect redirects
  • Manual control: Use followRedirects(false) for custom handling
  • Loop prevention: Track visited URLs to prevent infinite loops
  • URL resolution: Handle relative redirect URLs properly
  • Error handling: Account for missing Location headers and various status codes

Always respect robots.txt files and website terms of service when implementing redirect handling in your web scraping applications.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon