How do I set custom HTTP headers with jsoup?

Jsoup is a powerful HTML parsing library for Java that allows you to scrape and parse HTML from a web page. When scraping web pages, there may be instances where you need to set custom HTTP headers to simulate a browser request, handle authentication, or interact with the web server in a specific way.

To set custom HTTP headers with Jsoup, you'll use the header() method of the Connection object before executing the request. Here's a step-by-step example of how to do this:

  1. Include Jsoup in your project. If you're using Maven, add the following dependency to your pom.xml:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.3</version> <!-- Check for the latest version on https://jsoup.org/download -->
</dependency>
  1. Use the Jsoup.connect() method to create a connection to the desired URL.

  2. Use the header() method on the Connection object to set custom headers.

  3. Execute the request using the get() or post() methods, depending on the type of request you want to make.

Here's an example of setting custom HTTP headers using Jsoup in Java:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.Connection;

public class JsoupSetHeadersExample {
    public static void main(String[] args) {
        try {
            // The URL you want to connect to
            String url = "https://example.com";

            // Create the connection and set custom headers
            Connection connection = Jsoup.connect(url)
                    .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
                    .header("Accept-Language", "en-US,en;q=0.5")
                    // Add any other headers you need here
                    .header("Custom-Header", "Custom-Value");

            // Execute the request and retrieve the response document
            Document document = connection.get(); // or use .post() for POST requests

            // Do something with the document
            System.out.println(document.title());

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

This Java code snippet sets up a connection to "https://example.com" with custom HTTP headers, including a custom User-Agent, Accept, Accept-Language, and a Custom-Header. It then executes a GET request and prints out the title of the HTML document.

Remember to check the website's robots.txt file and terms of service before scraping to ensure that you're allowed to scrape their pages and that you respect their scraping policies. Additionally, make sure not to overload the website's servers with too many requests in a short period of time.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon