Jsoup is a Java library for working with real-world HTML. It provides a convenient API for extracting and manipulating data using DOM, CSS, and jQuery-like methods. Retrieving attribute values is one of the most common tasks when scraping web content.
Basic Attribute Extraction
To retrieve an element's attribute value using Jsoup, follow these steps:
- Parse the HTML to create a
Document
object - Select the element using CSS selectors or traversal methods
- Extract the attribute value using the
attr()
method
Simple Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupAttributeExample {
public static void main(String[] args) {
String html = "<html><head><title>Example</title></head>"
+ "<body><p><a href='https://example.com' title='Example Link'>Click here</a></p></body></html>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
// Extract different attributes
String href = link.attr("href");
String title = link.attr("title");
String text = link.text();
System.out.println("Href: " + href); // https://example.com
System.out.println("Title: " + title); // Example Link
System.out.println("Text: " + text); // Click here
}
}
Multiple Attribute Extraction
When working with multiple elements, you can extract attributes from all matching elements:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class MultipleAttributesExample {
public static void main(String[] args) {
String html = "<div>"
+ "<img src='image1.jpg' alt='First Image' width='100'>"
+ "<img src='image2.jpg' alt='Second Image' width='200'>"
+ "<img src='image3.jpg' alt='Third Image' width='150'>"
+ "</div>";
Document doc = Jsoup.parse(html);
Elements images = doc.select("img");
for (Element img : images) {
String src = img.attr("src");
String alt = img.attr("alt");
String width = img.attr("width");
System.out.printf("Image: %s, Alt: %s, Width: %s%n", src, alt, width);
}
}
}
Fetching from Remote URLs
When working with remote HTML pages, use Jsoup's connect()
method with proper error handling:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class RemoteAttributeExample {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0")
.timeout(5000)
.get();
// Extract all link attributes
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
String text = link.text().trim();
if (!href.isEmpty()) {
System.out.println("Link: " + href + " -> " + text);
}
}
} catch (IOException e) {
System.err.println("Error fetching page: " + e.getMessage());
}
}
}
Advanced Attribute Handling
Check if Attribute Exists
Element element = doc.select("img").first();
if (element.hasAttr("alt")) {
String alt = element.attr("alt");
System.out.println("Alt text: " + alt);
} else {
System.out.println("No alt attribute found");
}
Get Absolute URLs
// Convert relative URLs to absolute URLs
Element link = doc.select("a").first();
String absoluteHref = link.attr("abs:href");
System.out.println("Absolute URL: " + absoluteHref);
Default Values for Missing Attributes
// Provide default value if attribute doesn't exist
String title = element.attr("title");
if (title.isEmpty()) {
title = "No title available";
}
// Or use a helper method
public static String getAttrOrDefault(Element element, String attr, String defaultValue) {
String value = element.attr(attr);
return value.isEmpty() ? defaultValue : value;
}
Common Use Cases
Extracting Form Data
Elements forms = doc.select("form");
for (Element form : forms) {
String action = form.attr("action");
String method = form.attr("method");
System.out.println("Form submits to: " + action + " via " + method);
// Extract input fields
Elements inputs = form.select("input");
for (Element input : inputs) {
String name = input.attr("name");
String type = input.attr("type");
String value = input.attr("value");
System.out.printf("Input: %s (type: %s, value: %s)%n", name, type, value);
}
}
Extracting Meta Tags
Elements metaTags = doc.select("meta");
for (Element meta : metaTags) {
String name = meta.attr("name");
String property = meta.attr("property");
String content = meta.attr("content");
if (!name.isEmpty()) {
System.out.println("Meta " + name + ": " + content);
} else if (!property.isEmpty()) {
System.out.println("Property " + property + ": " + content);
}
}
Setup and Dependencies
Maven Dependency
Add the latest Jsoup dependency to your pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
Gradle Dependency
For Gradle projects, add to your build.gradle
:
dependencies {
implementation 'org.jsoup:jsoup:1.17.2'
}
Best Practices
- Always handle exceptions when fetching remote content
- Set appropriate timeouts to avoid hanging requests
- Use CSS selectors efficiently - specific selectors perform better
- Check if attributes exist before accessing them to avoid empty strings
- Use absolute URLs when working with links and images from remote pages
- Set a user agent when connecting to websites to avoid blocking
Check the official Jsoup documentation for the latest version and additional features.