Setting Custom HTTP Headers with Jsoup
Jsoup is a powerful HTML parsing library for Java that allows you to scrape and parse HTML from web pages. When web scraping, you often need to set custom HTTP headers to simulate browser requests, handle authentication, bypass basic anti-bot measures, or interact with web servers that require specific headers.
Basic Header Setting
To set custom HTTP headers with Jsoup, use the header()
method on the Connection
object before executing the request.
Quick Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.Connection;
public class BasicHeaderExample {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com")
.header("User-Agent", "My Custom Bot 1.0")
.header("Accept", "text/html")
.get();
System.out.println(doc.title());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Setup and Dependencies
Maven Dependency
Add Jsoup to your pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
Gradle Dependency
For Gradle projects, add to your build.gradle
:
implementation 'org.jsoup:jsoup:1.17.2'
Common Header Scenarios
1. Browser Simulation
Simulate a real browser to avoid detection:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class BrowserSimulation {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com")
.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8")
.header("Accept-Language", "en-US,en;q=0.5")
.header("Accept-Encoding", "gzip, deflate, br")
.header("Connection", "keep-alive")
.header("Upgrade-Insecure-Requests", "1")
.header("Sec-Fetch-Dest", "document")
.header("Sec-Fetch-Mode", "navigate")
.header("Sec-Fetch-Site", "none")
.header("Cache-Control", "max-age=0")
.get();
System.out.println("Page title: " + doc.title());
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. API Authentication
Handle authentication headers:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class AuthenticationExample {
public static void main(String[] args) {
try {
// Bearer token authentication
Document doc = Jsoup.connect("https://api.example.com/data")
.header("Authorization", "Bearer your-api-token-here")
.header("Content-Type", "application/json")
.get();
// Basic authentication (alternative method)
Document doc2 = Jsoup.connect("https://api.example.com/data")
.header("Authorization", "Basic " +
java.util.Base64.getEncoder().encodeToString("username:password".getBytes()))
.get();
// API key in custom header
Document doc3 = Jsoup.connect("https://api.example.com/data")
.header("X-API-Key", "your-api-key")
.header("X-RapidAPI-Host", "example.rapidapi.com")
.get();
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. Referer and Origin Headers
For sites that check referrer information:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class ReferrerExample {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com/protected-page")
.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.header("Referer", "https://example.com/login")
.header("Origin", "https://example.com")
.get();
System.out.println(doc.title());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Advanced Header Management
Multiple Headers with Method Chaining
import org.jsoup.Jsoup;
import org.jsoup.Connection;
import org.jsoup.nodes.Document;
import java.util.Map;
import java.util.HashMap;
public class AdvancedHeaderExample {
public static void main(String[] args) {
try {
// Method 1: Chaining headers
Connection connection = Jsoup.connect("https://example.com")
.header("User-Agent", "Custom Bot 1.0")
.header("Accept", "text/html")
.header("Accept-Language", "en-US")
.header("Custom-Header", "Custom-Value")
.timeout(10000);
Document doc = connection.get();
// Method 2: Using a map for multiple headers
Map<String, String> headers = new HashMap<>();
headers.put("User-Agent", "Custom Bot 1.0");
headers.put("Accept", "application/json");
headers.put("X-Requested-With", "XMLHttpRequest");
Document doc2 = Jsoup.connect("https://api.example.com")
.headers(headers)
.get();
} catch (Exception e) {
e.printStackTrace();
}
}
}
POST Requests with Custom Headers
import org.jsoup.Jsoup;
import org.jsoup.Connection;
import org.jsoup.nodes.Document;
public class PostWithHeadersExample {
public static void main(String[] args) {
try {
Connection.Response response = Jsoup.connect("https://example.com/api/submit")
.header("Content-Type", "application/x-www-form-urlencoded")
.header("X-Requested-With", "XMLHttpRequest")
.header("User-Agent", "Mozilla/5.0 Custom Bot")
.data("username", "user")
.data("password", "pass")
.method(Connection.Method.POST)
.execute();
Document doc = response.parse();
System.out.println("Response: " + doc.text());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Common Header Types
| Header | Purpose | Example |
|--------|---------|---------|
| User-Agent
| Identifies the client application | Mozilla/5.0 (Windows NT 10.0; Win64; x64)
|
| Accept
| Specifies acceptable content types | text/html,application/json
|
| Authorization
| Authentication credentials | Bearer token
or Basic encoded
|
| Referer
| Previous page URL | https://example.com/previous
|
| Content-Type
| Format of request body | application/json
|
| X-API-Key
| Custom API authentication | your-api-key-here
|
| Accept-Language
| Preferred language | en-US,en;q=0.9
|
| Cookie
| Session/state information | sessionid=abc123
|
Error Handling and Best Practices
import org.jsoup.Jsoup;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class RobustHeaderExample {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com")
.header("User-Agent", "Mozilla/5.0 (compatible; Bot 1.0)")
.header("Accept", "text/html,application/xhtml+xml")
.timeout(10000)
.followRedirects(true)
.get();
System.out.println("Success: " + doc.title());
} catch (HttpStatusException e) {
System.err.println("HTTP error: " + e.getStatusCode() + " " + e.getUrl());
} catch (UnsupportedMimeTypeException e) {
System.err.println("Unsupported MIME type: " + e.getMimeType());
} catch (IOException e) {
System.err.println("IO error: " + e.getMessage());
}
}
}
Key Points to Remember
- Always set a User-Agent: Many sites block requests without proper User-Agent headers
- Respect robots.txt: Check the website's robots.txt file before scraping
- Rate limiting: Don't overwhelm servers with too many requests
- Legal compliance: Ensure you comply with the website's terms of service
- Header order: Some servers are sensitive to header order; Jsoup preserves the order you set them
- Case sensitivity: HTTP headers are case-insensitive, but some servers may be picky
Testing Your Headers
You can test what headers your request sends using services like httpbin.org:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HeaderTesting {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("http://httpbin.org/headers")
.header("User-Agent", "My Custom Bot")
.header("Custom-Header", "Test-Value")
.get();
System.out.println(doc.text()); // Shows all headers sent
} catch (Exception e) {
e.printStackTrace();
}
}
}
This comprehensive approach to setting HTTP headers with Jsoup will help you handle most web scraping scenarios effectively while maintaining good practices.