Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data by using DOM, CSS, and jquery-like methods.
To select elements using CSS selectors in jsoup, you can use the select
method on a Document
or Element
object. The select
method takes a CSS selector string as its argument and returns a list of Elements (Elements
class) that match the selector.
Here is a step-by-step guide on how to use CSS selectors with jsoup:
- Include jsoup in your project: If you're using Maven, add the following dependency to your
pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version> <!-- Use the latest version available -->
</dependency>
For Gradle, add this to your build.gradle
:
implementation 'org.jsoup:jsoup:1.14.3'
Or if you are not using a build system, download the jar file from the official website and add it to your project's classpath.
- Parse an HTML string or load from a URL: You can parse an HTML string directly or load content from a URL to get a
Document
object.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExample {
public static void main(String[] args) {
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
// Or load from a URL
// Document doc = Jsoup.connect("http://example.com").get();
}
}
- Select elements using CSS selectors: Use the
select
method with the CSS selector string to get elements.
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupExample {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://example.com").get();
// Select all the anchor tags
Elements links = doc.select("a");
// Select elements with a specific class
Elements elementsWithClass = doc.select(".myclass");
// Select elements with a specific ID
Element elementWithId = doc.select("#myid").first(); // ID should be unique
// Select elements that contain a certain text
Elements elementsContainingText = doc.select("div:contains(Your Text Here)");
// Iterate over the results
for (Element link : links) {
System.out.println("Link: " + link.attr("href"));
System.out.println("Text: " + link.text());
}
}
}
jsoup's CSS selector syntax is similar to that of CSS and jQuery. Here are some common patterns:
tag
: selects all elements with the given tag name, e.g.,a
for anchor tags..class
: selects all elements with the given class name.#id
: selects the element with the given ID.[attribute]
: selects all elements with the given attribute.ancestor descendant
: selects all descendants ofancestor
that match thedescendant
selector.parent > child
: selects all direct children elements ofparent
.prev + next
: selects allnext
elements that are immediately preceded by a siblingprev
.prev ~ siblings
: selects all sibling elements that follow after theprev
element and have the same parent.
Remember, the select
method returns an Elements
collection, which you can iterate over, or use methods like first()
to get the first matched element. If you expect only one result, ensure you handle the case where the element might not be found to avoid a NullPointerException
.