Table of contents

How can I select elements using CSS selectors in jsoup?

Jsoup is a powerful Java library for working with real-world HTML that provides a convenient API for extracting and manipulating data using DOM, CSS, and jQuery-like methods. One of its most powerful features is CSS selector support through the select() method.

Getting Started with jsoup

1. Add jsoup to Your Project

Maven (pom.xml):

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

Gradle (build.gradle):

implementation 'org.jsoup:jsoup:1.17.2'

2. Parse HTML Content

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

// Parse HTML string
String html = "<html><body><div class='content'><p>Hello World</p></div></body></html>";
Document doc = Jsoup.parse(html);

// Load from URL
Document doc = Jsoup.connect("https://example.com")
    .userAgent("Mozilla/5.0")
    .timeout(5000)
    .get();

// Parse from file
Document doc = Jsoup.parse(new File("example.html"), "UTF-8");

Basic CSS Selectors

Tag Selectors

// Select all paragraphs
Elements paragraphs = doc.select("p");

// Select all images
Elements images = doc.select("img");

// Select all form inputs
Elements inputs = doc.select("input");

Class Selectors

// Select elements with specific class
Elements content = doc.select(".content");
Elements highlights = doc.select(".highlight");

// Multiple classes (elements with both classes)
Elements items = doc.select(".item.active");

ID Selectors

// Select element by ID
Element header = doc.select("#header").first();
Element mainContent = doc.select("#main-content").first();

// Safe ID selection with null check
Element nav = doc.select("#navigation").first();
if (nav != null) {
    System.out.println("Navigation found: " + nav.text());
}

Attribute Selectors

// Elements with specific attribute
Elements links = doc.select("a[href]");
Elements required = doc.select("input[required]");

// Attribute with specific value
Elements externalLinks = doc.select("a[target=_blank]");
Elements emailInputs = doc.select("input[type=email]");

// Attribute contains value
Elements images = doc.select("img[src*='.jpg']");
Elements socialLinks = doc.select("a[href*='facebook.com']");

// Attribute starts with value
Elements httpsLinks = doc.select("a[href^='https://']");

// Attribute ends with value
Elements pdfLinks = doc.select("a[href$='.pdf']");

Advanced CSS Selectors

Hierarchical Selectors

// Descendant selector (any level)
Elements articleParagraphs = doc.select("article p");

// Direct child selector
Elements directChildren = doc.select("ul > li");

// Adjacent sibling selector
Elements nextElements = doc.select("h2 + p");

// General sibling selector
Elements siblings = doc.select("h2 ~ p");

Pseudo-selectors

// First and last elements
Element firstItem = doc.select("li:first-child").first();
Element lastItem = doc.select("li:last-child").first();

// Nth elements
Elements evenRows = doc.select("tr:nth-child(even)");
Elements oddRows = doc.select("tr:nth-child(odd)");
Element thirdItem = doc.select("li:nth-child(3)").first();

// Elements containing text
Elements newsItems = doc.select("div:contains(news)");
Elements matchingText = doc.select("p:containsOwn(exact text)");

// Empty elements
Elements emptyDivs = doc.select("div:empty");

Practical Examples

Web Scraping Example

public class WebScrapingExample {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("https://news.ycombinator.com")
            .userAgent("Mozilla/5.0")
            .get();

        // Extract news titles
        Elements titles = doc.select("a.storylink");
        for (Element title : titles) {
            System.out.println("Title: " + title.text());
            System.out.println("URL: " + title.attr("href"));
        }

        // Extract user scores
        Elements scores = doc.select("span.score");
        for (Element score : scores) {
            System.out.println("Score: " + score.text());
        }
    }
}

Form Data Extraction

public void extractFormData(Document doc) {
    // Get all form fields
    Elements forms = doc.select("form");

    for (Element form : forms) {
        System.out.println("Form action: " + form.attr("action"));

        // Get input fields
        Elements inputs = form.select("input");
        for (Element input : inputs) {
            String type = input.attr("type");
            String name = input.attr("name");
            String value = input.attr("value");

            System.out.printf("Input - Type: %s, Name: %s, Value: %s%n", 
                            type, name, value);
        }

        // Get select options
        Elements selects = form.select("select");
        for (Element select : selects) {
            Elements options = select.select("option");
            System.out.println("Select options:");
            for (Element option : options) {
                System.out.println("  " + option.text() + " = " + option.attr("value"));
            }
        }
    }
}

Table Data Extraction

public void extractTableData(Document doc) {
    Elements tables = doc.select("table");

    for (Element table : tables) {
        // Extract headers
        Elements headers = table.select("thead tr th");
        System.out.println("Headers:");
        for (Element header : headers) {
            System.out.print(header.text() + "\t");
        }
        System.out.println();

        // Extract rows
        Elements rows = table.select("tbody tr");
        for (Element row : rows) {
            Elements cells = row.select("td");
            for (Element cell : cells) {
                System.out.print(cell.text() + "\t");
            }
            System.out.println();
        }
    }
}

Best Practices and Error Handling

Safe Element Access

// Always check for null when expecting single elements
Element element = doc.select("#myid").first();
if (element != null) {
    String text = element.text();
    String href = element.attr("href");
}

// Use isEmpty() for collections
Elements elements = doc.select(".myclass");
if (!elements.isEmpty()) {
    for (Element el : elements) {
        // Process elements
    }
}

// Get text with fallback
String title = doc.select("title").text();
if (title.isEmpty()) {
    title = "No title found";
}

Combining Selectors

// Complex selector combinations
Elements items = doc.select("div.article:has(img):contains(breaking news)");
Elements links = doc.select("a[href]:not([href^='mailto:'])");
Elements validInputs = doc.select("input[type=text]:not([disabled])");

Performance Tips

// Cache frequently used selections
Elements navigationLinks = doc.select("nav a");

// Use specific selectors to improve performance
Elements specificElements = doc.select("div.content > p.highlight");
// Better than: doc.select("p.highlight");

// Limit search scope when possible
Element contentDiv = doc.select("#content").first();
if (contentDiv != null) {
    Elements internalLinks = contentDiv.select("a");
}

CSS Selector Reference

| Selector | Description | Example | |----------|-------------|---------| | tag | Tag name | div, p, a | | .class | Class name | .content, .highlight | | #id | Element ID | #header, #main | | [attr] | Has attribute | [href], [required] | | [attr=value] | Attribute equals | [type=text] | | [attr*=value] | Attribute contains | [href*=github] | | [attr^=value] | Attribute starts with | [href^=https] | | [attr$=value] | Attribute ends with | [href$=.pdf] | | parent child | Descendant | div p | | parent > child | Direct child | ul > li | | prev + next | Adjacent sibling | h2 + p | | prev ~ sibling | General sibling | h2 ~ p | | :first-child | First child | li:first-child | | :last-child | Last child | li:last-child | | :nth-child(n) | Nth child | :nth-child(2) | | :contains(text) | Contains text | :contains(news) | | :empty | Empty element | div:empty |

The select() method returns an Elements collection that you can iterate over or use methods like first(), last(), get(index), or isEmpty() to access specific elements safely.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon