How do I parse an HTML string with jsoup?

Parsing an HTML string with jsoup is quite straightforward. Jsoup is a Java library that is commonly used for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Here's a simple example of how to parse an HTML string with jsoup in Java:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupExample {
    public static void main(String[] args) {
        String htmlString = "<html><head><title>Sample Title</title></head>"
                          + "<body><p>Parsed HTML into a doc.</p></body></html>";

        // Parse the HTML string with jsoup
        Document doc = Jsoup.parse(htmlString);

        // Get the title from the parsed HTML
        String title = doc.title();
        System.out.println("Title: " + title);

        // Get the text of the body
        String bodyText = doc.body().text();
        System.out.println("Body text: " + bodyText);

        // If you want to select elements using CSS selectors
        Elements paragraphs = doc.select("p");
        for (Element paragraph : paragraphs) {
            System.out.println("Paragraph text: " + paragraph.text());
        }
    }
}

In the above example, we first import the necessary classes from the jsoup library. We then define an HTML string that we want to parse. Using the Jsoup.parse method, we parse the HTML string into a Document object, which represents the entire HTML document in a structured form.

Once we have the Document object, we can use it to perform various operations:

  • doc.title() extracts the title of the HTML page.
  • doc.body().text() gets the combined text of the body tag.
  • doc.select("p") uses a CSS selector to get all <p> elements within the document.

After running the code, the output will be as follows:

Title: Sample Title
Body text: Parsed HTML into a doc.
Paragraph text: Parsed HTML into a doc.

Make sure to include jsoup in your project's dependencies. If you're using Maven, you can add the following dependency to your pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version> <!-- Use the latest version available -->
</dependency>

If you're using Gradle, add this to your build.gradle:

dependencies {
    implementation 'org.jsoup:jsoup:1.14.3' // Use the latest version available
}

Always check for the latest version of jsoup to use in your project.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon