Table of contents

What are the Common HTTP Status Codes I Should Handle in Java Web Scraping?

When building web scraping applications in Java, understanding and properly handling HTTP status codes is crucial for creating robust and reliable scrapers. HTTP status codes provide valuable information about the success or failure of your requests, allowing you to implement appropriate error handling, retry logic, and graceful degradation strategies.

Understanding HTTP Status Code Categories

HTTP status codes are three-digit numbers grouped into five categories:

  • 1xx (Informational): Request received, continuing process
  • 2xx (Success): Request successfully received, understood, and accepted
  • 3xx (Redirection): Further action must be taken to complete the request
  • 4xx (Client Error): Request contains bad syntax or cannot be fulfilled
  • 5xx (Server Error): Server failed to fulfill a valid request

Essential Success Status Codes

200 OK

The most common success status code indicates that the request was successful and the server returned the requested data.

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import java.time.Duration;

public class WebScraper {
    private final HttpClient client;

    public WebScraper() {
        this.client = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(10))
            .build();
    }

    public String scrapeContent(String url) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .timeout(Duration.ofSeconds(30))
            .build();

        HttpResponse<String> response = client.send(request, 
            HttpResponse.BodyHandlers.ofString());

        if (response.statusCode() == 200) {
            return response.body();
        } else {
            throw new RuntimeException("Unexpected status code: " + response.statusCode());
        }
    }
}

204 No Content

Indicates successful request but no content to return. Useful for tracking API endpoints or form submissions.

public boolean submitForm(String url, String formData) throws Exception {
    HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create(url))
        .header("Content-Type", "application/x-www-form-urlencoded")
        .POST(HttpRequest.BodyPublishers.ofString(formData))
        .build();

    HttpResponse<String> response = client.send(request, 
        HttpResponse.BodyHandlers.ofString());

    return response.statusCode() == 200 || response.statusCode() == 204;
}

Critical Redirection Status Codes

301 Moved Permanently & 302 Found

These redirects are common and should be handled automatically in most cases. Java's HttpClient follows redirects by default.

public class RedirectAwareScraper {
    private final HttpClient client;

    public RedirectAwareScraper() {
        this.client = HttpClient.newBuilder()
            .followRedirects(HttpClient.Redirect.NORMAL)
            .connectTimeout(Duration.ofSeconds(10))
            .build();
    }

    public ScrapingResult scrapeWithRedirectTracking(String url) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .build();

        HttpResponse<String> response = client.send(request, 
            HttpResponse.BodyHandlers.ofString());

        return new ScrapingResult(
            response.body(),
            response.statusCode(),
            response.uri().toString() // Final URL after redirects
        );
    }

    static class ScrapingResult {
        private final String content;
        private final int statusCode;
        private final String finalUrl;

        public ScrapingResult(String content, int statusCode, String finalUrl) {
            this.content = content;
            this.statusCode = statusCode;
            this.finalUrl = finalUrl;
        }

        // Getters...
    }
}

Client Error Status Codes to Handle

400 Bad Request

Indicates malformed request syntax. Often caused by invalid parameters or headers.

public class ErrorHandlingScraper {

    public String scrapeWithErrorHandling(String url, Map<String, String> headers) {
        try {
            HttpRequest.Builder requestBuilder = HttpRequest.newBuilder()
                .uri(URI.create(url));

            // Add custom headers
            headers.forEach(requestBuilder::header);

            HttpRequest request = requestBuilder.build();
            HttpResponse<String> response = client.send(request, 
                HttpResponse.BodyHandlers.ofString());

            switch (response.statusCode()) {
                case 200:
                    return response.body();
                case 400:
                    throw new IllegalArgumentException("Bad request - check URL and parameters: " + url);
                case 401:
                    throw new SecurityException("Authentication required for: " + url);
                case 403:
                    throw new SecurityException("Access forbidden for: " + url);
                case 404:
                    throw new ResourceNotFoundException("Resource not found: " + url);
                default:
                    throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
            }
        } catch (Exception e) {
            throw new RuntimeException("Failed to scrape: " + url, e);
        }
    }
}

401 Unauthorized & 403 Forbidden

These indicate authentication or authorization issues that require different handling strategies.

public class AuthenticatedScraper {
    private final HttpClient client;
    private String authToken;

    public String scrapeProtectedResource(String url) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .header("Authorization", "Bearer " + authToken)
            .header("User-Agent", "JavaScraper/1.0")
            .build();

        HttpResponse<String> response = client.send(request, 
            HttpResponse.BodyHandlers.ofString());

        switch (response.statusCode()) {
            case 200:
                return response.body();
            case 401:
                // Token might be expired, try to refresh
                refreshAuthToken();
                return scrapeProtectedResource(url); // Retry once
            case 403:
                throw new SecurityException("Access denied - insufficient permissions");
            default:
                throw new RuntimeException("Unexpected status: " + response.statusCode());
        }
    }

    private void refreshAuthToken() {
        // Implementation for token refresh
    }
}

404 Not Found

One of the most common errors in web scraping, indicating the requested resource doesn't exist.

public Optional<String> scrapeOptionalContent(String url) {
    try {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .build();

        HttpResponse<String> response = client.send(request, 
            HttpResponse.BodyHandlers.ofString());

        if (response.statusCode() == 200) {
            return Optional.of(response.body());
        } else if (response.statusCode() == 404) {
            System.out.println("Resource not found: " + url);
            return Optional.empty();
        } else {
            throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
        }
    } catch (Exception e) {
        System.err.println("Error scraping " + url + ": " + e.getMessage());
        return Optional.empty();
    }
}

429 Too Many Requests

Critical for avoiding rate limiting issues. Requires implementing exponential backoff and retry logic.

import java.util.concurrent.ThreadLocalRandom;

public class RateLimitAwareScraper {
    private static final int MAX_RETRIES = 3;
    private static final long BASE_DELAY_MS = 1000;

    public String scrapeWithRateLimit(String url) throws Exception {
        return scrapeWithRetry(url, 0);
    }

    private String scrapeWithRetry(String url, int retryCount) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .build();

        HttpResponse<String> response = client.send(request, 
            HttpResponse.BodyHandlers.ofString());

        switch (response.statusCode()) {
            case 200:
                return response.body();
            case 429:
                if (retryCount < MAX_RETRIES) {
                    long delay = calculateBackoffDelay(retryCount, response);
                    Thread.sleep(delay);
                    return scrapeWithRetry(url, retryCount + 1);
                } else {
                    throw new RuntimeException("Rate limit exceeded after " + MAX_RETRIES + " retries");
                }
            default:
                throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
        }
    }

    private long calculateBackoffDelay(int retryCount, HttpResponse<String> response) {
        // Check for Retry-After header
        Optional<String> retryAfter = response.headers().firstValue("Retry-After");
        if (retryAfter.isPresent()) {
            try {
                return Long.parseLong(retryAfter.get()) * 1000; // Convert to milliseconds
            } catch (NumberFormatException ignored) {}
        }

        // Exponential backoff with jitter
        long delay = BASE_DELAY_MS * (1L << retryCount);
        long jitter = ThreadLocalRandom.current().nextLong(delay / 4);
        return delay + jitter;
    }
}

Server Error Status Codes

500 Internal Server Error & 502 Bad Gateway

These server-side errors often indicate temporary issues that may resolve with retry attempts.

public class RobustScraper {
    private static final int MAX_SERVER_ERROR_RETRIES = 2;

    public String scrapeWithServerErrorHandling(String url) throws Exception {
        Exception lastException = null;

        for (int attempt = 0; attempt <= MAX_SERVER_ERROR_RETRIES; attempt++) {
            try {
                HttpRequest request = HttpRequest.newBuilder()
                    .uri(URI.create(url))
                    .timeout(Duration.ofSeconds(30))
                    .build();

                HttpResponse<String> response = client.send(request, 
                    HttpResponse.BodyHandlers.ofString());

                switch (response.statusCode()) {
                    case 200:
                        return response.body();
                    case 500:
                    case 502:
                    case 503:
                    case 504:
                        if (attempt < MAX_SERVER_ERROR_RETRIES) {
                            Thread.sleep(2000 * (attempt + 1)); // Progressive delay
                            continue;
                        }
                        throw new RuntimeException("Server error " + response.statusCode() + 
                            " persisted after " + MAX_SERVER_ERROR_RETRIES + " retries");
                    default:
                        throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
                }
            } catch (Exception e) {
                lastException = e;
                if (attempt < MAX_SERVER_ERROR_RETRIES) {
                    Thread.sleep(1000 * (attempt + 1));
                }
            }
        }

        throw new RuntimeException("Failed to scrape after " + MAX_SERVER_ERROR_RETRIES + " attempts", lastException);
    }
}

Comprehensive Status Code Handler

Here's a complete example that handles all major status codes:

public class ComprehensiveWebScraper {
    private final HttpClient client;
    private static final Logger logger = LoggerFactory.getLogger(ComprehensiveWebScraper.class);

    public enum ScrapingResult {
        SUCCESS, NOT_FOUND, RATE_LIMITED, SERVER_ERROR, CLIENT_ERROR, NETWORK_ERROR
    }

    public class ScrapingResponse {
        private final ScrapingResult result;
        private final String content;
        private final int statusCode;
        private final String error;

        // Constructor and getters...
    }

    public ScrapingResponse scrape(String url) {
        try {
            HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(url))
                .header("User-Agent", "Mozilla/5.0 (compatible; JavaScraper/1.0)")
                .timeout(Duration.ofSeconds(30))
                .build();

            HttpResponse<String> response = client.send(request, 
                HttpResponse.BodyHandlers.ofString());

            return handleResponse(response, url);

        } catch (Exception e) {
            logger.error("Network error scraping {}: {}", url, e.getMessage());
            return new ScrapingResponse(ScrapingResult.NETWORK_ERROR, null, -1, e.getMessage());
        }
    }

    private ScrapingResponse handleResponse(HttpResponse<String> response, String url) {
        int statusCode = response.statusCode();

        // Success codes
        if (statusCode >= 200 && statusCode < 300) {
            return new ScrapingResponse(ScrapingResult.SUCCESS, response.body(), statusCode, null);
        }

        // Client errors
        if (statusCode >= 400 && statusCode < 500) {
            switch (statusCode) {
                case 404:
                    return new ScrapingResponse(ScrapingResult.NOT_FOUND, null, statusCode, "Resource not found");
                case 429:
                    return new ScrapingResponse(ScrapingResult.RATE_LIMITED, null, statusCode, "Rate limit exceeded");
                default:
                    return new ScrapingResponse(ScrapingResult.CLIENT_ERROR, null, statusCode, 
                        "Client error: " + statusCode);
            }
        }

        // Server errors
        if (statusCode >= 500) {
            return new ScrapingResponse(ScrapingResult.SERVER_ERROR, null, statusCode, 
                "Server error: " + statusCode);
        }

        // Redirects (if not handled automatically)
        if (statusCode >= 300 && statusCode < 400) {
            String location = response.headers().firstValue("Location").orElse("Unknown");
            return new ScrapingResponse(ScrapingResult.CLIENT_ERROR, null, statusCode, 
                "Redirect to: " + location);
        }

        return new ScrapingResponse(ScrapingResult.CLIENT_ERROR, null, statusCode, 
            "Unexpected status code: " + statusCode);
    }
}

Best Practices for Status Code Handling

1. Implement Proper Logging

private void logResponse(String url, int statusCode, String method) {
    if (statusCode >= 200 && statusCode < 300) {
        logger.info("Successfully {} {}: HTTP {}", method, url, statusCode);
    } else if (statusCode >= 400 && statusCode < 500) {
        logger.warn("Client error {} {}: HTTP {}", method, url, statusCode);
    } else if (statusCode >= 500) {
        logger.error("Server error {} {}: HTTP {}", method, url, statusCode);
    }
}

2. Use Circuit Breaker Pattern

For production applications, implement circuit breaker patterns to handle repeated failures gracefully and avoid overwhelming failing services.

3. Monitor and Alert

Set up monitoring for different status codes to identify patterns and issues:

// Example metrics collection
public void recordStatusCode(int statusCode, String url) {
    String category = getStatusCategory(statusCode);
    metricsCollector.increment("scraping.status." + category, 
        Tags.of("url", sanitizeUrl(url)));
}

private String getStatusCategory(int statusCode) {
    if (statusCode >= 200 && statusCode < 300) return "success";
    if (statusCode >= 300 && statusCode < 400) return "redirect";
    if (statusCode >= 400 && statusCode < 500) return "client_error";
    if (statusCode >= 500) return "server_error";
    return "unknown";
}

Conclusion

Proper HTTP status code handling is essential for building reliable Java web scraping applications. By implementing comprehensive error handling, retry logic, and monitoring, you can create scrapers that gracefully handle various scenarios and provide valuable feedback about their operation. Remember to always respect rate limits, implement appropriate delays, and follow the target website's robots.txt and terms of service.

For complex scraping scenarios involving JavaScript-heavy sites, you might need to consider browser automation tools alongside HTTP client libraries. Always test your error handling thoroughly and monitor your scrapers in production to ensure they perform reliably across different conditions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon