How can I select elements using CSS selectors in jsoup?

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data by using DOM, CSS, and jquery-like methods.

To select elements using CSS selectors in jsoup, you can use the select method on a Document or Element object. The select method takes a CSS selector string as its argument and returns a list of Elements (Elements class) that match the selector.

Here is a step-by-step guide on how to use CSS selectors with jsoup:

  1. Include jsoup in your project: If you're using Maven, add the following dependency to your pom.xml:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version> <!-- Use the latest version available -->
</dependency>

For Gradle, add this to your build.gradle:

implementation 'org.jsoup:jsoup:1.14.3'

Or if you are not using a build system, download the jar file from the official website and add it to your project's classpath.

  1. Parse an HTML string or load from a URL: You can parse an HTML string directly or load content from a URL to get a Document object.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupExample {
    public static void main(String[] args) {
        String html = "<html><head><title>First parse</title></head>"
                + "<body><p>Parsed HTML into a doc.</p></body></html>";
        Document doc = Jsoup.parse(html);
        // Or load from a URL
        // Document doc = Jsoup.connect("http://example.com").get();
    }
}
  1. Select elements using CSS selectors: Use the select method with the CSS selector string to get elements.
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupExample {
    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("http://example.com").get();

        // Select all the anchor tags
        Elements links = doc.select("a");

        // Select elements with a specific class
        Elements elementsWithClass = doc.select(".myclass");

        // Select elements with a specific ID
        Element elementWithId = doc.select("#myid").first(); // ID should be unique

        // Select elements that contain a certain text
        Elements elementsContainingText = doc.select("div:contains(Your Text Here)");

        // Iterate over the results
        for (Element link : links) {
            System.out.println("Link: " + link.attr("href"));
            System.out.println("Text: " + link.text());
        }
    }
}

jsoup's CSS selector syntax is similar to that of CSS and jQuery. Here are some common patterns:

  • tag: selects all elements with the given tag name, e.g., a for anchor tags.
  • .class: selects all elements with the given class name.
  • #id: selects the element with the given ID.
  • [attribute]: selects all elements with the given attribute.
  • ancestor descendant: selects all descendants of ancestor that match the descendant selector.
  • parent > child: selects all direct children elements of parent.
  • prev + next: selects all next elements that are immediately preceded by a sibling prev.
  • prev ~ siblings: selects all sibling elements that follow after the prev element and have the same parent.

Remember, the select method returns an Elements collection, which you can iterate over, or use methods like first() to get the first matched element. If you expect only one result, ensure you handle the case where the element might not be found to avoid a NullPointerException.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon