How can I parse HTML content using JSoup in Java?
JSoup is a powerful Java library for parsing HTML content, manipulating DOM elements, and extracting data from web pages. It provides a convenient API similar to jQuery for selecting and manipulating HTML elements, making it an excellent choice for web scraping and HTML processing tasks in Java applications.
What is JSoup?
JSoup is an open-source Java library that parses HTML using the best available techniques. It creates a DOM tree from HTML content and provides a fluent API for finding, extracting, and manipulating data. JSoup can handle malformed HTML gracefully and is particularly useful for web scraping, data extraction, and HTML cleaning tasks.
Setting Up JSoup
Maven Dependency
Add JSoup to your Maven project by including this dependency in your pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
Gradle Dependency
For Gradle projects, add this to your build.gradle
:
implementation 'org.jsoup:jsoup:1.17.2'
Basic HTML Parsing
Parsing HTML from String
The most basic way to use JSoup is parsing HTML content from a string:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class BasicJSoupExample {
public static void main(String[] args) {
String html = "<html><head><title>Sample Page</title></head>"
+ "<body><p class='content'>Hello, World!</p>"
+ "<div id='main'><a href='https://example.com'>Link</a></div>"
+ "</body></html>";
// Parse HTML string
Document doc = Jsoup.parse(html);
// Extract title
String title = doc.title();
System.out.println("Title: " + title);
// Extract text content
String content = doc.select("p.content").text();
System.out.println("Content: " + content);
}
}
Parsing HTML from URL
JSoup can directly fetch and parse HTML from web URLs:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class URLParsingExample {
public static void main(String[] args) {
try {
// Fetch and parse HTML from URL
Document doc = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(5000)
.get();
// Extract data
String title = doc.title();
Elements links = doc.select("a[href]");
System.out.println("Page title: " + title);
System.out.println("Number of links: " + links.size());
} catch (IOException e) {
System.err.println("Error fetching page: " + e.getMessage());
}
}
}
Parsing HTML from File
You can also parse HTML content from local files:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.File;
import java.io.IOException;
public class FileParsingExample {
public static void main(String[] args) {
try {
File input = new File("path/to/your/file.html");
Document doc = Jsoup.parse(input, "UTF-8");
// Process the document
String title = doc.title();
System.out.println("Document title: " + title);
} catch (IOException e) {
System.err.println("Error reading file: " + e.getMessage());
}
}
}
CSS Selectors and Element Selection
JSoup supports powerful CSS selectors for finding elements:
Basic Selectors
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SelectorExample {
public static void main(String[] args) {
String html = "<html><body>"
+ "<div class='container'>"
+ "<h1 id='title'>Main Title</h1>"
+ "<p class='text'>First paragraph</p>"
+ "<p class='text highlight'>Second paragraph</p>"
+ "<ul><li>Item 1</li><li>Item 2</li></ul>"
+ "</div></body></html>";
Document doc = Jsoup.parse(html);
// Select by tag
Elements paragraphs = doc.select("p");
// Select by class
Elements textElements = doc.select(".text");
// Select by ID
Element title = doc.select("#title").first();
// Select by attribute
Elements highlighted = doc.select("p.highlight");
// Complex selectors
Elements listItems = doc.select("div.container ul li");
System.out.println("Paragraphs found: " + paragraphs.size());
System.out.println("Title text: " + (title != null ? title.text() : "Not found"));
}
}
Advanced Selectors
public class AdvancedSelectors {
public static void demonstrateSelectors(Document doc) {
// Attribute selectors
Elements linksWithHref = doc.select("a[href]");
Elements externalLinks = doc.select("a[href^=http]");
Elements pdfLinks = doc.select("a[href$=.pdf]");
// Pseudo-selectors
Element firstParagraph = doc.select("p:first-child").first();
Elements evenTableRows = doc.select("tr:nth-child(even)");
// Combinators
Elements directChildren = doc.select("div > p");
Elements descendants = doc.select("div p");
Elements siblings = doc.select("h1 + p");
// Text content selectors
Elements containsText = doc.select("p:contains(specific text)");
Elements matchesRegex = doc.select("p:matches(\\d+)");
}
}
Data Extraction and Manipulation
Extracting Text and Attributes
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class DataExtraction {
public static void main(String[] args) {
String html = "<div><a href='https://example.com' title='Example Site'>Visit Example</a>"
+ "<img src='image.jpg' alt='Sample Image' width='300'>"
+ "<p data-id='123'>This is a paragraph with custom data.</p></div>";
Document doc = Jsoup.parse(html);
// Extract text content
Element link = doc.select("a").first();
if (link != null) {
String linkText = link.text();
String href = link.attr("href");
String title = link.attr("title");
System.out.println("Link text: " + linkText);
System.out.println("Link URL: " + href);
System.out.println("Link title: " + title);
}
// Extract image attributes
Element image = doc.select("img").first();
if (image != null) {
String src = image.attr("src");
String alt = image.attr("alt");
String width = image.attr("width");
System.out.println("Image source: " + src);
System.out.println("Alt text: " + alt);
System.out.println("Width: " + width);
}
// Extract custom data attributes
Element paragraph = doc.select("p[data-id]").first();
if (paragraph != null) {
String dataId = paragraph.attr("data-id");
String text = paragraph.text();
System.out.println("Data ID: " + dataId);
System.out.println("Paragraph text: " + text);
}
}
}
Modifying HTML Content
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class HTMLModification {
public static void main(String[] args) {
String html = "<html><body><div id='content'><p>Original content</p></div></body></html>";
Document doc = Jsoup.parse(html);
// Modify text content
Element content = doc.select("#content p").first();
if (content != null) {
content.text("Modified content");
}
// Add new elements
Element contentDiv = doc.select("#content").first();
if (contentDiv != null) {
contentDiv.append("<p>New paragraph added</p>");
contentDiv.prepend("<h2>Added Title</h2>");
}
// Modify attributes
Element paragraph = doc.select("p").first();
if (paragraph != null) {
paragraph.attr("class", "modified");
paragraph.attr("data-modified", "true");
}
// Remove elements
Elements toRemove = doc.select("p:contains(Original)");
toRemove.remove();
// Output modified HTML
System.out.println(doc.html());
}
}
Web Scraping with JSoup
Complete Web Scraping Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class WebScrapingExample {
static class Article {
private String title;
private String url;
private String summary;
public Article(String title, String url, String summary) {
this.title = title;
this.url = url;
this.summary = summary;
}
@Override
public String toString() {
return "Article{title='" + title + "', url='" + url + "', summary='" + summary + "'}";
}
}
public static List<Article> scrapeArticles(String url) {
List<Article> articles = new ArrayList<>();
try {
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(10000)
.followRedirects(true)
.get();
Elements articleElements = doc.select("article, .article, .post");
for (Element article : articleElements) {
String title = extractTitle(article);
String articleUrl = extractUrl(article, url);
String summary = extractSummary(article);
if (title != null && !title.isEmpty()) {
articles.add(new Article(title, articleUrl, summary));
}
}
} catch (IOException e) {
System.err.println("Error scraping articles: " + e.getMessage());
}
return articles;
}
private static String extractTitle(Element article) {
Element titleElement = article.select("h1, h2, h3, .title, .headline").first();
return titleElement != null ? titleElement.text().trim() : null;
}
private static String extractUrl(Element article, String baseUrl) {
Element linkElement = article.select("a[href]").first();
if (linkElement != null) {
String href = linkElement.attr("href");
return href.startsWith("http") ? href : baseUrl + href;
}
return null;
}
private static String extractSummary(Element article) {
Element summaryElement = article.select("p, .summary, .excerpt").first();
return summaryElement != null ? summaryElement.text().trim() : "";
}
public static void main(String[] args) {
List<Article> articles = scrapeArticles("https://example-news-site.com");
articles.forEach(System.out::println);
}
}
Advanced JSoup Features
Handling Forms and POST Requests
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.Map;
public class FormHandling {
public static void submitForm() {
try {
// First, get the form page
Document formPage = Jsoup.connect("https://example.com/login")
.get();
// Extract any hidden form fields or CSRF tokens
String csrfToken = formPage.select("input[name=_token]").attr("value");
// Submit form data
Connection.Response response = Jsoup.connect("https://example.com/login")
.data("username", "your_username")
.data("password", "your_password")
.data("_token", csrfToken)
.method(Connection.Method.POST)
.execute();
// Get cookies from response
Map<String, String> cookies = response.cookies();
// Use cookies for subsequent requests
Document loggedInPage = Jsoup.connect("https://example.com/dashboard")
.cookies(cookies)
.get();
System.out.println("Logged in successfully");
} catch (IOException e) {
System.err.println("Form submission failed: " + e.getMessage());
}
}
}
Error Handling and Best Practices
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.SocketTimeoutException;
public class RobustScraping {
public static Document safeConnect(String url, int maxRetries) {
int retries = 0;
while (retries < maxRetries) {
try {
return Jsoup.connect(url)
.userAgent("Mozilla/5.0 (compatible; JavaBot)")
.timeout(10000)
.followRedirects(true)
.ignoreHttpErrors(true)
.get();
} catch (SocketTimeoutException e) {
retries++;
System.err.println("Timeout occurred, retry " + retries + "/" + maxRetries);
if (retries < maxRetries) {
try {
Thread.sleep(2000); // Wait before retry
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
break;
}
}
} catch (IOException e) {
System.err.println("Connection failed: " + e.getMessage());
break;
}
}
return null;
}
public static void safeExtraction(Document doc) {
if (doc == null) {
System.err.println("Document is null, cannot extract data");
return;
}
// Safe element selection with null checks
Elements titles = doc.select("h1");
if (!titles.isEmpty()) {
String title = titles.first().text();
System.out.println("Title: " + title);
} else {
System.out.println("No title found");
}
// Safe attribute extraction
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
String text = link.text();
if (!href.isEmpty() && !text.isEmpty()) {
System.out.println("Link: " + text + " -> " + href);
}
}
}
}
Performance Optimization
Efficient Memory Usage
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
public class PerformanceOptimization {
public static void optimizedParsing() {
// For large HTML documents, consider streaming
String html = "<!-- Large HTML content -->";
// Parse with specific settings for better performance
Document doc = Jsoup.parse(html, "", Parser.htmlParser());
// Process elements in batches to reduce memory usage
Elements elements = doc.select("div");
int batchSize = 100;
for (int i = 0; i < elements.size(); i += batchSize) {
int end = Math.min(i + batchSize, elements.size());
Elements batch = new Elements(elements.subList(i, end));
// Process batch
processBatch(batch);
// Optional: Force garbage collection for large datasets
if (i % 1000 == 0) {
System.gc();
}
}
}
private static void processBatch(Elements batch) {
for (Element element : batch) {
// Process individual elements
String text = element.text();
// Do something with the text
}
}
}
Comparison with Other Technologies
While JSoup is excellent for server-side HTML parsing in Java, for JavaScript-heavy websites that require browser automation, you might need tools like Puppeteer for handling dynamic content or browser automation for authentication.
Conclusion
JSoup is a powerful and flexible library for HTML parsing in Java applications. It provides an intuitive API for selecting, extracting, and manipulating HTML content, making it ideal for web scraping, data extraction, and HTML processing tasks. Key benefits include:
- Easy to use: jQuery-like syntax for element selection
- Robust parsing: Handles malformed HTML gracefully
- Rich API: Comprehensive methods for data extraction and manipulation
- Network support: Built-in HTTP client for web scraping
- Memory efficient: Optimized for processing large HTML documents
Whether you're building a web scraper, processing HTML files, or extracting data from web pages, JSoup provides the tools you need to work effectively with HTML content in Java applications.
Remember to always respect robots.txt files, implement proper rate limiting, and follow ethical web scraping practices when using JSoup for web scraping projects.