No, jsoup does not support regular expressions within its selectors out of the box. Jsoup selectors are based on CSS selectors and provide a very convenient way to select elements from an HTML document. However, they do not include the ability to directly use regular expressions to match text or attributes within elements.
If you want to incorporate regular expression matching into your jsoup workflow, you would need to first use jsoup's CSS selector syntax to narrow down the selection of elements and then manually apply Java's regular expression capabilities to further process the elements.
Here's an example of how you might do this in Java:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class JsoupWithRegex {
public static void main(String[] args) {
String html = "<div><p>Data 123</p><p>Other 456</p><p>Last 789</p></div>";
Document doc = Jsoup.parse(html);
// First, use jsoup to select the elements you're interested in
Elements elements = doc.select("p");
// Define your regular expression
Pattern pattern = Pattern.compile("\\d{3}");
for (Element element : elements) {
// For each element, get the text and apply the regular expression
String text = element.text();
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
// If the regular expression matches, do something with the element
System.out.println("Matched text: " + matcher.group());
}
}
}
}
In this example, we're selecting all <p>
elements and then using Java's Pattern
and Matcher
classes to find sequences of three digits within the text of each element.
If you find yourself needing to apply complex filtering based on the text content of elements, it might be worth considering whether jsoup's own text matching capabilities, such as the :contains
and :matches
pseudo-selectors, will suffice. These do not use regular expressions but can handle simple text matching:
:contains(text)
: selects elements that contain the specified text.:matches(regex)
: selects elements whose text matches the specified regular expression.
Here's an example using the :matches
pseudo-selector:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JsoupMatchesExample {
public static void main(String[] args) {
String html = "<div><p>Data 123</p><p>Other 456</p><p>Last 789</p></div>";
Document doc = Jsoup.parse(html);
// Use jsoup's :matches pseudo-selector to select elements matching a regex
Elements elements = doc.select("p:matches(\\d{3})");
for (Element element : elements) {
System.out.println("Matched element: " + element);
}
}
}
This will select <p>
elements whose text contains sequences of three digits.
Remember that when using :matches
in jsoup, you need to double-escape backslashes in your Java string that represents the regular expression.