How can I scrape data from REST APIs using Java?

Scraping data from REST APIs in Java involves making HTTP requests to API endpoints and processing the JSON or XML responses. Unlike traditional web scraping that parses HTML content, API scraping provides structured data that's easier to work with. This guide covers the most effective approaches using Java's built-in HTTP client and popular third-party libraries.

Understanding REST API Scraping vs Web Scraping

REST API scraping differs significantly from traditional web scraping. While web scraping extracts data from HTML pages, API scraping retrieves structured data directly from endpoints. This approach is more reliable, efficient, and less likely to break when websites change their frontend design.

Core HTTP Clients for Java API Scraping

1. Java 11+ HttpClient (Recommended)

Java's built-in HttpClient, introduced in Java 11, provides a modern, asynchronous approach to HTTP requests:

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import java.time.Duration;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;

public class ApiScraper {
    private final HttpClient httpClient;
    private final ObjectMapper objectMapper;

    public ApiScraper() {
        this.httpClient = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(10))
            .followRedirects(HttpClient.Redirect.NORMAL)
            .build();
        this.objectMapper = new ObjectMapper();
    }

    public JsonNode fetchApiData(String apiUrl) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(apiUrl))
            .header("Accept", "application/json")
            .header("User-Agent", "Java-ApiScraper/1.0")
            .timeout(Duration.ofSeconds(30))
            .GET()
            .build();

        HttpResponse<String> response = httpClient.send(request, 
            HttpResponse.BodyHandlers.ofString());

        if (response.statusCode() == 200) {
            return objectMapper.readTree(response.body());
        } else {
            throw new RuntimeException("API request failed: " + response.statusCode());
        }
    }
}

2. OkHttp Library

OkHttp provides excellent performance and features for HTTP requests:

import okhttp3.*;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.util.concurrent.TimeUnit;

public class OkHttpApiScraper {
    private final OkHttpClient client;
    private final ObjectMapper objectMapper;

    public OkHttpApiScraper() {
        this.client = new OkHttpClient.Builder()
            .connectTimeout(10, TimeUnit.SECONDS)
            .readTimeout(30, TimeUnit.SECONDS)
            .addInterceptor(new RetryInterceptor())
            .build();
        this.objectMapper = new ObjectMapper();
    }

    public JsonNode scrapeEndpoint(String url) throws IOException {
        Request request = new Request.Builder()
            .url(url)
            .addHeader("Accept", "application/json")
            .addHeader("User-Agent", "OkHttp-Scraper/1.0")
            .build();

        try (Response response = client.newCall(request).execute()) {
            if (!response.isSuccessful()) {
                throw new IOException("Request failed: " + response.code());
            }
            return objectMapper.readTree(response.body().string());
        }
    }
}

Handling Authentication

API Key Authentication

Most APIs require authentication. Here's how to handle API key authentication:

public class AuthenticatedApiScraper {
    private final HttpClient httpClient;
    private final String apiKey;

    public AuthenticatedApiScraper(String apiKey) {
        this.httpClient = HttpClient.newHttpClient();
        this.apiKey = apiKey;
    }

    public JsonNode fetchWithApiKey(String endpoint) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(endpoint))
            .header("Authorization", "Bearer " + apiKey)
            .header("Accept", "application/json")
            .GET()
            .build();

        HttpResponse<String> response = httpClient.send(request,
            HttpResponse.BodyHandlers.ofString());

        return new ObjectMapper().readTree(response.body());
    }
}

OAuth 2.0 Authentication

For OAuth 2.0 protected APIs:

import java.util.Base64;

public class OAuth2ApiScraper {
    private String accessToken;
    private final HttpClient httpClient;

    public OAuth2ApiScraper() {
        this.httpClient = HttpClient.newHttpClient();
    }

    public void authenticate(String clientId, String clientSecret, String tokenUrl) 
            throws Exception {
        String credentials = Base64.getEncoder()
            .encodeToString((clientId + ":" + clientSecret).getBytes());

        HttpRequest tokenRequest = HttpRequest.newBuilder()
            .uri(URI.create(tokenUrl))
            .header("Authorization", "Basic " + credentials)
            .header("Content-Type", "application/x-www-form-urlencoded")
            .POST(HttpRequest.BodyPublishers.ofString("grant_type=client_credentials"))
            .build();

        HttpResponse<String> response = httpClient.send(tokenRequest,
            HttpResponse.BodyHandlers.ofString());

        JsonNode tokenResponse = new ObjectMapper().readTree(response.body());
        this.accessToken = tokenResponse.get("access_token").asText();
    }

    public JsonNode fetchProtectedResource(String endpoint) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(endpoint))
            .header("Authorization", "Bearer " + accessToken)
            .GET()
            .build();

        HttpResponse<String> response = httpClient.send(request,
            HttpResponse.BodyHandlers.ofString());

        return new ObjectMapper().readTree(response.body());
    }
}

Implementing Rate Limiting and Retry Logic

Professional API scraping requires proper rate limiting and error handling:

import java.util.concurrent.TimeUnit;
import java.util.concurrent.Semaphore;
import java.util.concurrent.CompletableFuture;

public class RateLimitedApiScraper {
    private final HttpClient httpClient;
    private final Semaphore rateLimiter;
    private final int maxRetries = 3;

    public RateLimitedApiScraper(int requestsPerSecond) {
        this.httpClient = HttpClient.newHttpClient();
        this.rateLimiter = new Semaphore(requestsPerSecond);
    }

    public JsonNode fetchWithRateLimit(String url) throws Exception {
        rateLimiter.acquire();

        try {
            return executeWithRetry(url, 0);
        } finally {
            // Release permit after 1 second
            CompletableFuture.delayedExecutor(1, TimeUnit.SECONDS)
                .execute(rateLimiter::release);
        }
    }

    private JsonNode executeWithRetry(String url, int attempt) throws Exception {
        try {
            HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(url))
                .header("Accept", "application/json")
                .timeout(Duration.ofSeconds(30))
                .GET()
                .build();

            HttpResponse<String> response = httpClient.send(request,
                HttpResponse.BodyHandlers.ofString());

            if (response.statusCode() == 429) { // Rate limited
                if (attempt < maxRetries) {
                    Thread.sleep((1L << attempt) * 1000); // Exponential backoff
                    return executeWithRetry(url, attempt + 1);
                }
                throw new RuntimeException("Rate limit exceeded after retries");
            }

            return new ObjectMapper().readTree(response.body());

        } catch (Exception e) {
            if (attempt < maxRetries) {
                Thread.sleep(2000);
                return executeWithRetry(url, attempt + 1);
            }
            throw e;
        }
    }
}

Handling Pagination

Many APIs implement pagination to limit response sizes. Here's how to handle common pagination patterns:

Offset-based Pagination

import java.util.List;
import java.util.ArrayList;

public class PaginatedApiScraper {
    private final HttpClient httpClient = HttpClient.newHttpClient();
    private final ObjectMapper objectMapper = new ObjectMapper();

    public List<JsonNode> scrapeAllPages(String baseUrl, int pageSize) throws Exception {
        List<JsonNode> allData = new ArrayList<>();
        int offset = 0;
        boolean hasMoreData = true;

        while (hasMoreData) {
            String url = String.format("%s?limit=%d&offset=%d", baseUrl, pageSize, offset);

            JsonNode response = fetchPage(url);
            JsonNode data = response.get("data");

            if (data != null && data.isArray() && data.size() > 0) {
                for (JsonNode item : data) {
                    allData.add(item);
                }
                offset += pageSize;
            } else {
                hasMoreData = false;
            }

            // Respect rate limits
            Thread.sleep(1000);
        }

        return allData;
    }

    private JsonNode fetchPage(String url) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .header("Accept", "application/json")
            .GET()
            .build();

        HttpResponse<String> response = httpClient.send(request,
            HttpResponse.BodyHandlers.ofString());

        return objectMapper.readTree(response.body());
    }
}

Processing and Storing Data

Convert API responses to Java objects for easier processing:

import com.fasterxml.jackson.databind.DeserializationFeature;

// Define data models
public class Product {
    private String id;
    private String name;
    private double price;
    private String category;

    // Constructors, getters, setters
    public Product() {}

    public String getId() { return id; }
    public void setId(String id) { this.id = id; }

    public String getName() { return name; }
    public void setName(String name) { this.name = name; }

    public double getPrice() { return price; }
    public void setPrice(double price) { this.price = price; }

    public String getCategory() { return category; }
    public void setCategory(String category) { this.category = category; }
}

public class ApiDataProcessor {
    private final ObjectMapper objectMapper;

    public ApiDataProcessor() {
        this.objectMapper = new ObjectMapper();
        objectMapper.configure(DeserializationFeature.IGNORE_UNKNOWN_PROPERTIES, true);
    }

    public List<Product> processProductData(JsonNode apiResponse) throws Exception {
        List<Product> products = new ArrayList<>();
        JsonNode items = apiResponse.get("items");

        if (items != null && items.isArray()) {
            for (JsonNode item : items) {
                Product product = objectMapper.treeToValue(item, Product.class);
                products.add(product);
            }
        }

        return products;
    }

    public void saveToDatabase(List<Product> products) {
        // Database saving logic using JPA, JDBC, etc.
        products.forEach(product -> {
            // Save individual product
            System.out.println("Saving: " + product.getName());
        });
    }
}

Advanced Configuration with Spring WebClient

For Spring-based applications, WebClient provides excellent reactive support:

import org.springframework.web.reactive.function.client.WebClient;
import org.springframework.http.HttpHeaders;
import org.springframework.http.MediaType;
import org.springframework.http.HttpStatus;
import reactor.core.publisher.Mono;
import reactor.core.publisher.Flux;
import org.springframework.stereotype.Component;

@Component
public class SpringApiScraper {
    private final WebClient webClient;

    public SpringApiScraper() {
        this.webClient = WebClient.builder()
            .baseUrl("https://api.example.com")
            .defaultHeader(HttpHeaders.USER_AGENT, "Spring-WebClient/1.0")
            .defaultHeader(HttpHeaders.ACCEPT, MediaType.APPLICATION_JSON_VALUE)
            .codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(1024 * 1024))
            .build();
    }

    public Mono<JsonNode> fetchAsync(String endpoint) {
        return webClient.get()
            .uri(endpoint)
            .retrieve()
            .onStatus(HttpStatus::isError, response -> {
                return Mono.error(new RuntimeException("API Error: " + response.statusCode()));
            })
            .bodyToMono(JsonNode.class)
            .timeout(Duration.ofSeconds(30));
    }

    public Flux<Product> streamProducts(String endpoint) {
        return webClient.get()
            .uri(endpoint)
            .retrieve()
            .bodyToFlux(Product.class)
            .onErrorResume(throwable -> {
                // Handle errors gracefully
                return Flux.empty();
            });
    }
}

Best Practices for Production API Scraping

1. Connection Pooling and Resource Management

import java.util.concurrent.Executors;

public class ProductionApiScraper {
    private final HttpClient httpClient;

    public ProductionApiScraper() {
        this.httpClient = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(10))
            .followRedirects(HttpClient.Redirect.NORMAL)
            .executor(Executors.newFixedThreadPool(10))
            .build();
    }

    // Always close resources properly
    public void shutdown() {
        if (httpClient instanceof AutoCloseable) {
            try {
                ((AutoCloseable) httpClient).close();
            } catch (Exception e) {
                // Log error
            }
        }
    }
}

2. Comprehensive Error Handling

public enum ApiErrorType {
    RATE_LIMIT_EXCEEDED,
    AUTHENTICATION_FAILED,
    NETWORK_ERROR,
    PARSE_ERROR,
    UNKNOWN_ERROR
}

public class ApiScrapingException extends Exception {
    private final ApiErrorType errorType;
    private final int statusCode;

    public ApiScrapingException(ApiErrorType errorType, String message, int statusCode) {
        super(message);
        this.errorType = errorType;
        this.statusCode = statusCode;
    }

    public ApiErrorType getErrorType() { return errorType; }
    public int getStatusCode() { return statusCode; }
}

Monitoring and Logging

Implement comprehensive logging for production environments:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;

public class MonitoredApiScraper {
    private static final Logger logger = LoggerFactory.getLogger(MonitoredApiScraper.class);
    private final MeterRegistry meterRegistry;

    public MonitoredApiScraper(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    public JsonNode fetchWithMetrics(String endpoint) {
        Timer.Sample sample = Timer.start(meterRegistry);

        try {
            logger.info("Fetching data from endpoint: {}", endpoint);
            JsonNode result = performRequest(endpoint);

            meterRegistry.counter("api.requests.success").increment();
            logger.info("Successfully fetched data from: {}", endpoint);

            return result;
        } catch (Exception e) {
            meterRegistry.counter("api.requests.error").increment();
            logger.error("Failed to fetch data from: {}", endpoint, e);
            throw e;
        } finally {
            sample.stop(Timer.builder("api.request.duration").register(meterRegistry));
        }
    }

    private JsonNode performRequest(String endpoint) throws Exception {
        // Implementation details
        return new ObjectMapper().createObjectNode();
    }
}

Handling Complex Scenarios

When APIs don't provide all the data you need, you might need to combine API scraping with traditional web scraping techniques. For dynamic content that requires JavaScript execution, consider using tools that can handle AJAX requests and dynamic loading.

Combining API and Web Scraping

public class HybridScraper {
    private final ApiScraper apiScraper;
    private final WebDriver webDriver;

    public HybridScraper() {
        this.apiScraper = new ApiScraper();
        // Initialize WebDriver for cases where API data is insufficient
        this.webDriver = new ChromeDriver();
    }

    public CombinedData scrapeComplete(String productId) throws Exception {
        // First, try to get data from API
        JsonNode apiData = apiScraper.fetchApiData("/api/products/" + productId);

        // If API doesn't have all required data, scrape the web page
        if (needsWebScraping(apiData)) {
            String webData = scrapeWebPage("/products/" + productId);
            return combineData(apiData, webData);
        }

        return new CombinedData(apiData);
    }

    private boolean needsWebScraping(JsonNode apiData) {
        // Logic to determine if web scraping is needed
        return apiData.get("reviews") == null;
    }

    private String scrapeWebPage(String url) {
        webDriver.get("https://example.com" + url);
        return webDriver.findElement(By.className("reviews")).getText();
    }
}

Console Commands and Testing

Use these Maven dependencies in your pom.xml:

<dependencies>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.15.2</version>
    </dependency>
    <dependency>
        <groupId>com.squareup.okhttp3</groupId>
        <artifactId>okhttp</artifactId>
        <version>4.11.0</version>
    </dependency>
</dependencies>

Test your API scraper with these commands:

# Compile the project
mvn compile

# Run tests
mvn test

# Run with specific API endpoint
java -cp target/classes ApiScraper https://api.example.com/data

# Monitor HTTP traffic (useful for debugging)
java -Djava.net.useSystemProxies=true -Dhttp.proxyHost=localhost -Dhttp.proxyPort=8080 ApiScraper

Performance Optimization Tips

Use connection pooling to reuse HTTP connections
Implement caching for frequently accessed endpoints
Use parallel processing for multiple API calls when rate limits allow
Monitor memory usage when processing large datasets
Implement circuit breakers for unreliable APIs

For applications requiring browser-level interaction or handling complex authentication flows, consider integrating your Java API scraper with browser automation tools.

Conclusion

Scraping data from REST APIs using Java provides a robust, scalable approach to data collection. The built-in HttpClient in Java 11+ offers excellent performance for most use cases, while libraries like OkHttp and Spring WebClient provide additional features for complex scenarios.

Key considerations for successful API scraping include proper authentication handling, respect for rate limits, comprehensive error handling, and efficient data processing. When APIs don't provide sufficient data, consider combining API scraping with traditional web scraping techniques for complete data collection solutions.

Remember to always monitor your applications, implement proper logging, and ensure your scraping activities comply with the API provider's terms of service and rate limiting policies.

Table of contents