How do I set custom HTTP headers with jsoup?

Setting Custom HTTP Headers with Jsoup

Jsoup is a powerful HTML parsing library for Java that allows you to scrape and parse HTML from web pages. When web scraping, you often need to set custom HTTP headers to simulate browser requests, handle authentication, bypass basic anti-bot measures, or interact with web servers that require specific headers.

Basic Header Setting

To set custom HTTP headers with Jsoup, use the header() method on the Connection object before executing the request.

Quick Example

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.Connection;

public class BasicHeaderExample {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com")
                    .header("User-Agent", "My Custom Bot 1.0")
                    .header("Accept", "text/html")
                    .get();

            System.out.println(doc.title());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Setup and Dependencies

Maven Dependency

Add Jsoup to your pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

Gradle Dependency

For Gradle projects, add to your build.gradle:

implementation 'org.jsoup:jsoup:1.17.2'

Common Header Scenarios

1. Browser Simulation

Simulate a real browser to avoid detection:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class BrowserSimulation {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com")
                    .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8")
                    .header("Accept-Language", "en-US,en;q=0.5")
                    .header("Accept-Encoding", "gzip, deflate, br")
                    .header("Connection", "keep-alive")
                    .header("Upgrade-Insecure-Requests", "1")
                    .header("Sec-Fetch-Dest", "document")
                    .header("Sec-Fetch-Mode", "navigate")
                    .header("Sec-Fetch-Site", "none")
                    .header("Cache-Control", "max-age=0")
                    .get();

            System.out.println("Page title: " + doc.title());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2. API Authentication

Handle authentication headers:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class AuthenticationExample {
    public static void main(String[] args) {
        try {
            // Bearer token authentication
            Document doc = Jsoup.connect("https://api.example.com/data")
                    .header("Authorization", "Bearer your-api-token-here")
                    .header("Content-Type", "application/json")
                    .get();

            // Basic authentication (alternative method)
            Document doc2 = Jsoup.connect("https://api.example.com/data")
                    .header("Authorization", "Basic " + 
                            java.util.Base64.getEncoder().encodeToString("username:password".getBytes()))
                    .get();

            // API key in custom header
            Document doc3 = Jsoup.connect("https://api.example.com/data")
                    .header("X-API-Key", "your-api-key")
                    .header("X-RapidAPI-Host", "example.rapidapi.com")
                    .get();

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

3. Referer and Origin Headers

For sites that check referrer information:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class ReferrerExample {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com/protected-page")
                    .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
                    .header("Referer", "https://example.com/login")
                    .header("Origin", "https://example.com")
                    .get();

            System.out.println(doc.title());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Advanced Header Management

Multiple Headers with Method Chaining

import org.jsoup.Jsoup;
import org.jsoup.Connection;
import org.jsoup.nodes.Document;
import java.util.Map;
import java.util.HashMap;

public class AdvancedHeaderExample {
    public static void main(String[] args) {
        try {
            // Method 1: Chaining headers
            Connection connection = Jsoup.connect("https://example.com")
                    .header("User-Agent", "Custom Bot 1.0")
                    .header("Accept", "text/html")
                    .header("Accept-Language", "en-US")
                    .header("Custom-Header", "Custom-Value")
                    .timeout(10000);

            Document doc = connection.get();

            // Method 2: Using a map for multiple headers
            Map<String, String> headers = new HashMap<>();
            headers.put("User-Agent", "Custom Bot 1.0");
            headers.put("Accept", "application/json");
            headers.put("X-Requested-With", "XMLHttpRequest");

            Document doc2 = Jsoup.connect("https://api.example.com")
                    .headers(headers)
                    .get();

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

POST Requests with Custom Headers

import org.jsoup.Jsoup;
import org.jsoup.Connection;
import org.jsoup.nodes.Document;

public class PostWithHeadersExample {
    public static void main(String[] args) {
        try {
            Connection.Response response = Jsoup.connect("https://example.com/api/submit")
                    .header("Content-Type", "application/x-www-form-urlencoded")
                    .header("X-Requested-With", "XMLHttpRequest")
                    .header("User-Agent", "Mozilla/5.0 Custom Bot")
                    .data("username", "user")
                    .data("password", "pass")
                    .method(Connection.Method.POST)
                    .execute();

            Document doc = response.parse();
            System.out.println("Response: " + doc.text());

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Common Header Types

| Header | Purpose | Example | |--------|---------|---------| | User-Agent | Identifies the client application | Mozilla/5.0 (Windows NT 10.0; Win64; x64) | | Accept | Specifies acceptable content types | text/html,application/json | | Authorization | Authentication credentials | Bearer token or Basic encoded | | Referer | Previous page URL | https://example.com/previous | | Content-Type | Format of request body | application/json | | X-API-Key | Custom API authentication | your-api-key-here | | Accept-Language | Preferred language | en-US,en;q=0.9 | | Cookie | Session/state information | sessionid=abc123 |

Error Handling and Best Practices

import org.jsoup.Jsoup;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class RobustHeaderExample {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com")
                    .header("User-Agent", "Mozilla/5.0 (compatible; Bot 1.0)")
                    .header("Accept", "text/html,application/xhtml+xml")
                    .timeout(10000)
                    .followRedirects(true)
                    .get();

            System.out.println("Success: " + doc.title());

        } catch (HttpStatusException e) {
            System.err.println("HTTP error: " + e.getStatusCode() + " " + e.getUrl());
        } catch (UnsupportedMimeTypeException e) {
            System.err.println("Unsupported MIME type: " + e.getMimeType());
        } catch (IOException e) {
            System.err.println("IO error: " + e.getMessage());
        }
    }
}

Key Points to Remember

  • Always set a User-Agent: Many sites block requests without proper User-Agent headers
  • Respect robots.txt: Check the website's robots.txt file before scraping
  • Rate limiting: Don't overwhelm servers with too many requests
  • Legal compliance: Ensure you comply with the website's terms of service
  • Header order: Some servers are sensitive to header order; Jsoup preserves the order you set them
  • Case sensitivity: HTTP headers are case-insensitive, but some servers may be picky

Testing Your Headers

You can test what headers your request sends using services like httpbin.org:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HeaderTesting {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("http://httpbin.org/headers")
                    .header("User-Agent", "My Custom Bot")
                    .header("Custom-Header", "Test-Value")
                    .get();

            System.out.println(doc.text()); // Shows all headers sent
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

This comprehensive approach to setting HTTP headers with Jsoup will help you handle most web scraping scenarios effectively while maintaining good practices.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon