What is jsoup and what is it used for?

jsoup is a Java library that is used for parsing HTML documents. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. jsoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do.

What is jsoup used for?

  • Scraping and Parsing HTML: jsoup is widely used for scraping data from websites. It can parse HTML from a URL, file, or string, and then you can use its API to extract data, making it a handy tool for web scraping tasks.
  • Cleaning HTML: jsoup can be used to sanitize user-submitted content against a safe white-list, to prevent XSS attacks.
  • Manipulating HTML: jsoup allows you to change the HTML elements, attributes, and text.
  • Extracting Data: You can extract and manipulate data using DOM traversal or CSS selectors to get specific elements from the document.

Example Usage of jsoup

The following example demonstrates basic usage of jsoup to fetch the title of a webpage:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupExample {
    public static void main(String[] args) {
        try {
            // Fetch the HTML code of a web page
            Document doc = Jsoup.connect("http://example.com/").get();

            // Extract the title of the web page
            String title = doc.title();
            System.out.println("Title of the page: " + title);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

In this example, we use Jsoup.connect to fetch the HTML from the specified URL, and then we use doc.title() to get the title of the web page.

jsoup Features

  • DOM navigation: Navigate the HTML document using methods like child, sibling, parent, etc.
  • CSS Selectors: Use CSS selectors to find elements.
  • Element manipulation: Modify the HTML elements, attributes, and text.
  • Text extraction: Extract and manipulate the text content of elements.
  • HTML cleaning: Clean user-submitted content against a safe list to prevent XSS attacks.

Getting Started with jsoup

To use jsoup, you can include it in your Maven or Gradle build:

Maven:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Gradle:

implementation 'org.jsoup:jsoup:1.14.3'

Alternatively, you can download the jar file directly from the jsoup website and include it in your project's classpath.

Please note that web scraping should be done responsibly and in compliance with the terms of service of the website and relevant laws. Always check the website's robots.txt file and terms of service to understand what is allowed and what isn't. Furthermore, excessive requests to a website can overload their servers and may lead to your IP being blocked. Be respectful and considerate when writing your web scraping tools.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon