What are the Common HTTP Status Codes I Should Handle in Java Web Scraping?
When building web scraping applications in Java, understanding and properly handling HTTP status codes is crucial for creating robust and reliable scrapers. HTTP status codes provide valuable information about the success or failure of your requests, allowing you to implement appropriate error handling, retry logic, and graceful degradation strategies.
Understanding HTTP Status Code Categories
HTTP status codes are three-digit numbers grouped into five categories:
- 1xx (Informational): Request received, continuing process
- 2xx (Success): Request successfully received, understood, and accepted
- 3xx (Redirection): Further action must be taken to complete the request
- 4xx (Client Error): Request contains bad syntax or cannot be fulfilled
- 5xx (Server Error): Server failed to fulfill a valid request
Essential Success Status Codes
200 OK
The most common success status code indicates that the request was successful and the server returned the requested data.
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import java.time.Duration;
public class WebScraper {
private final HttpClient client;
public WebScraper() {
this.client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
}
public String scrapeContent(String url) throws Exception {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(30))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
if (response.statusCode() == 200) {
return response.body();
} else {
throw new RuntimeException("Unexpected status code: " + response.statusCode());
}
}
}
204 No Content
Indicates successful request but no content to return. Useful for tracking API endpoints or form submissions.
public boolean submitForm(String url, String formData) throws Exception {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.header("Content-Type", "application/x-www-form-urlencoded")
.POST(HttpRequest.BodyPublishers.ofString(formData))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
return response.statusCode() == 200 || response.statusCode() == 204;
}
Critical Redirection Status Codes
301 Moved Permanently & 302 Found
These redirects are common and should be handled automatically in most cases. Java's HttpClient follows redirects by default.
public class RedirectAwareScraper {
private final HttpClient client;
public RedirectAwareScraper() {
this.client = HttpClient.newBuilder()
.followRedirects(HttpClient.Redirect.NORMAL)
.connectTimeout(Duration.ofSeconds(10))
.build();
}
public ScrapingResult scrapeWithRedirectTracking(String url) throws Exception {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
return new ScrapingResult(
response.body(),
response.statusCode(),
response.uri().toString() // Final URL after redirects
);
}
static class ScrapingResult {
private final String content;
private final int statusCode;
private final String finalUrl;
public ScrapingResult(String content, int statusCode, String finalUrl) {
this.content = content;
this.statusCode = statusCode;
this.finalUrl = finalUrl;
}
// Getters...
}
}
Client Error Status Codes to Handle
400 Bad Request
Indicates malformed request syntax. Often caused by invalid parameters or headers.
public class ErrorHandlingScraper {
public String scrapeWithErrorHandling(String url, Map<String, String> headers) {
try {
HttpRequest.Builder requestBuilder = HttpRequest.newBuilder()
.uri(URI.create(url));
// Add custom headers
headers.forEach(requestBuilder::header);
HttpRequest request = requestBuilder.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
switch (response.statusCode()) {
case 200:
return response.body();
case 400:
throw new IllegalArgumentException("Bad request - check URL and parameters: " + url);
case 401:
throw new SecurityException("Authentication required for: " + url);
case 403:
throw new SecurityException("Access forbidden for: " + url);
case 404:
throw new ResourceNotFoundException("Resource not found: " + url);
default:
throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
}
} catch (Exception e) {
throw new RuntimeException("Failed to scrape: " + url, e);
}
}
}
401 Unauthorized & 403 Forbidden
These indicate authentication or authorization issues that require different handling strategies.
public class AuthenticatedScraper {
private final HttpClient client;
private String authToken;
public String scrapeProtectedResource(String url) throws Exception {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.header("Authorization", "Bearer " + authToken)
.header("User-Agent", "JavaScraper/1.0")
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
switch (response.statusCode()) {
case 200:
return response.body();
case 401:
// Token might be expired, try to refresh
refreshAuthToken();
return scrapeProtectedResource(url); // Retry once
case 403:
throw new SecurityException("Access denied - insufficient permissions");
default:
throw new RuntimeException("Unexpected status: " + response.statusCode());
}
}
private void refreshAuthToken() {
// Implementation for token refresh
}
}
404 Not Found
One of the most common errors in web scraping, indicating the requested resource doesn't exist.
public Optional<String> scrapeOptionalContent(String url) {
try {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
if (response.statusCode() == 200) {
return Optional.of(response.body());
} else if (response.statusCode() == 404) {
System.out.println("Resource not found: " + url);
return Optional.empty();
} else {
throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
}
} catch (Exception e) {
System.err.println("Error scraping " + url + ": " + e.getMessage());
return Optional.empty();
}
}
429 Too Many Requests
Critical for avoiding rate limiting issues. Requires implementing exponential backoff and retry logic.
import java.util.concurrent.ThreadLocalRandom;
public class RateLimitAwareScraper {
private static final int MAX_RETRIES = 3;
private static final long BASE_DELAY_MS = 1000;
public String scrapeWithRateLimit(String url) throws Exception {
return scrapeWithRetry(url, 0);
}
private String scrapeWithRetry(String url, int retryCount) throws Exception {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
switch (response.statusCode()) {
case 200:
return response.body();
case 429:
if (retryCount < MAX_RETRIES) {
long delay = calculateBackoffDelay(retryCount, response);
Thread.sleep(delay);
return scrapeWithRetry(url, retryCount + 1);
} else {
throw new RuntimeException("Rate limit exceeded after " + MAX_RETRIES + " retries");
}
default:
throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
}
}
private long calculateBackoffDelay(int retryCount, HttpResponse<String> response) {
// Check for Retry-After header
Optional<String> retryAfter = response.headers().firstValue("Retry-After");
if (retryAfter.isPresent()) {
try {
return Long.parseLong(retryAfter.get()) * 1000; // Convert to milliseconds
} catch (NumberFormatException ignored) {}
}
// Exponential backoff with jitter
long delay = BASE_DELAY_MS * (1L << retryCount);
long jitter = ThreadLocalRandom.current().nextLong(delay / 4);
return delay + jitter;
}
}
Server Error Status Codes
500 Internal Server Error & 502 Bad Gateway
These server-side errors often indicate temporary issues that may resolve with retry attempts.
public class RobustScraper {
private static final int MAX_SERVER_ERROR_RETRIES = 2;
public String scrapeWithServerErrorHandling(String url) throws Exception {
Exception lastException = null;
for (int attempt = 0; attempt <= MAX_SERVER_ERROR_RETRIES; attempt++) {
try {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(30))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
switch (response.statusCode()) {
case 200:
return response.body();
case 500:
case 502:
case 503:
case 504:
if (attempt < MAX_SERVER_ERROR_RETRIES) {
Thread.sleep(2000 * (attempt + 1)); // Progressive delay
continue;
}
throw new RuntimeException("Server error " + response.statusCode() +
" persisted after " + MAX_SERVER_ERROR_RETRIES + " retries");
default:
throw new RuntimeException("HTTP " + response.statusCode() + " for: " + url);
}
} catch (Exception e) {
lastException = e;
if (attempt < MAX_SERVER_ERROR_RETRIES) {
Thread.sleep(1000 * (attempt + 1));
}
}
}
throw new RuntimeException("Failed to scrape after " + MAX_SERVER_ERROR_RETRIES + " attempts", lastException);
}
}
Comprehensive Status Code Handler
Here's a complete example that handles all major status codes:
public class ComprehensiveWebScraper {
private final HttpClient client;
private static final Logger logger = LoggerFactory.getLogger(ComprehensiveWebScraper.class);
public enum ScrapingResult {
SUCCESS, NOT_FOUND, RATE_LIMITED, SERVER_ERROR, CLIENT_ERROR, NETWORK_ERROR
}
public class ScrapingResponse {
private final ScrapingResult result;
private final String content;
private final int statusCode;
private final String error;
// Constructor and getters...
}
public ScrapingResponse scrape(String url) {
try {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.header("User-Agent", "Mozilla/5.0 (compatible; JavaScraper/1.0)")
.timeout(Duration.ofSeconds(30))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
return handleResponse(response, url);
} catch (Exception e) {
logger.error("Network error scraping {}: {}", url, e.getMessage());
return new ScrapingResponse(ScrapingResult.NETWORK_ERROR, null, -1, e.getMessage());
}
}
private ScrapingResponse handleResponse(HttpResponse<String> response, String url) {
int statusCode = response.statusCode();
// Success codes
if (statusCode >= 200 && statusCode < 300) {
return new ScrapingResponse(ScrapingResult.SUCCESS, response.body(), statusCode, null);
}
// Client errors
if (statusCode >= 400 && statusCode < 500) {
switch (statusCode) {
case 404:
return new ScrapingResponse(ScrapingResult.NOT_FOUND, null, statusCode, "Resource not found");
case 429:
return new ScrapingResponse(ScrapingResult.RATE_LIMITED, null, statusCode, "Rate limit exceeded");
default:
return new ScrapingResponse(ScrapingResult.CLIENT_ERROR, null, statusCode,
"Client error: " + statusCode);
}
}
// Server errors
if (statusCode >= 500) {
return new ScrapingResponse(ScrapingResult.SERVER_ERROR, null, statusCode,
"Server error: " + statusCode);
}
// Redirects (if not handled automatically)
if (statusCode >= 300 && statusCode < 400) {
String location = response.headers().firstValue("Location").orElse("Unknown");
return new ScrapingResponse(ScrapingResult.CLIENT_ERROR, null, statusCode,
"Redirect to: " + location);
}
return new ScrapingResponse(ScrapingResult.CLIENT_ERROR, null, statusCode,
"Unexpected status code: " + statusCode);
}
}
Best Practices for Status Code Handling
1. Implement Proper Logging
private void logResponse(String url, int statusCode, String method) {
if (statusCode >= 200 && statusCode < 300) {
logger.info("Successfully {} {}: HTTP {}", method, url, statusCode);
} else if (statusCode >= 400 && statusCode < 500) {
logger.warn("Client error {} {}: HTTP {}", method, url, statusCode);
} else if (statusCode >= 500) {
logger.error("Server error {} {}: HTTP {}", method, url, statusCode);
}
}
2. Use Circuit Breaker Pattern
For production applications, implement circuit breaker patterns to handle repeated failures gracefully and avoid overwhelming failing services.
3. Monitor and Alert
Set up monitoring for different status codes to identify patterns and issues:
// Example metrics collection
public void recordStatusCode(int statusCode, String url) {
String category = getStatusCategory(statusCode);
metricsCollector.increment("scraping.status." + category,
Tags.of("url", sanitizeUrl(url)));
}
private String getStatusCategory(int statusCode) {
if (statusCode >= 200 && statusCode < 300) return "success";
if (statusCode >= 300 && statusCode < 400) return "redirect";
if (statusCode >= 400 && statusCode < 500) return "client_error";
if (statusCode >= 500) return "server_error";
return "unknown";
}
Conclusion
Proper HTTP status code handling is essential for building reliable Java web scraping applications. By implementing comprehensive error handling, retry logic, and monitoring, you can create scrapers that gracefully handle various scenarios and provide valuable feedback about their operation. Remember to always respect rate limits, implement appropriate delays, and follow the target website's robots.txt and terms of service.
For complex scraping scenarios involving JavaScript-heavy sites, you might need to consider browser automation tools alongside HTTP client libraries. Always test your error handling thoroughly and monitor your scrapers in production to ensure they perform reliably across different conditions.