Jsoup is a powerful Java library for parsing and manipulating HTML documents. It provides a convenient API to extract and manipulate data using DOM, CSS selectors, and jQuery-like methods. This guide shows how to integrate jsoup into your Android application.
Step 1: Add jsoup Dependency
Add the jsoup library to your build.gradle
(Module: app) file in the dependencies
section:
dependencies {
implementation 'org.jsoup:jsoup:1.17.2' // Use the latest version
// other dependencies...
}
After adding the dependency, sync your project by clicking "Sync Now" in Android Studio.
Step 2: Add Internet Permission
Add the Internet permission to your AndroidManifest.xml
file:
<uses-permission android:name="android.permission.INTERNET" />
For apps targeting Android 9 (API level 28) and higher, also add network security config for HTTP URLs:
<application
android:networkSecurityConfig="@xml/network_security_config"
...>
</application>
Create res/xml/network_security_config.xml
:
<?xml version="1.0" encoding="utf-8"?>
<network-security-config>
<domain-config cleartextTrafficPermitted="true">
<domain includeSubdomains="true">example.com</domain>
</domain-config>
</network-security-config>
Step 3: Modern Threading with ExecutorService
Since AsyncTask is deprecated, use modern threading approaches. Here's an example using ExecutorService:
Java Implementation
public class MainActivity extends AppCompatActivity {
private ExecutorService executor;
private Handler mainHandler;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
executor = Executors.newFixedThreadPool(4);
mainHandler = new Handler(Looper.getMainLooper());
fetchWebsiteData();
}
private void fetchWebsiteData() {
executor.execute(() -> {
try {
// Connect to the website
Document document = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0 (Android)")
.timeout(10000)
.get();
// Extract data
String title = document.title();
Elements links = document.select("a[href]");
// Update UI on main thread
mainHandler.post(() -> {
TextView titleView = findViewById(R.id.titleTextView);
titleView.setText(title);
TextView linksCount = findViewById(R.id.linksCountTextView);
linksCount.setText("Links found: " + links.size());
});
} catch (IOException e) {
Log.e("MainActivity", "Error fetching website", e);
mainHandler.post(() -> {
Toast.makeText(this, "Error loading website", Toast.LENGTH_SHORT).show();
});
}
});
}
@Override
protected void onDestroy() {
super.onDestroy();
executor.shutdown();
}
}
Kotlin with Coroutines
class MainActivity : AppCompatActivity() {
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
fetchWebsiteData()
}
private fun fetchWebsiteData() {
lifecycleScope.launch {
try {
val result = withContext(Dispatchers.IO) {
val document = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0 (Android)")
.timeout(10000)
.get()
WebsiteData(
title = document.title(),
linkCount = document.select("a[href]").size,
description = document.select("meta[name=description]").attr("content")
)
}
// Update UI (automatically on main thread)
findViewById<TextView>(R.id.titleTextView).text = result.title
findViewById<TextView>(R.id.linksCountTextView).text = "Links: ${result.linkCount}"
findViewById<TextView>(R.id.descriptionTextView).text = result.description
} catch (e: IOException) {
Log.e("MainActivity", "Error fetching website", e)
Toast.makeText(this@MainActivity, "Error loading website", Toast.LENGTH_SHORT).show()
}
}
}
}
data class WebsiteData(
val title: String,
val linkCount: Int,
val description: String
)
Advanced jsoup Usage
Parsing HTML from Different Sources
// From URL
Document doc1 = Jsoup.connect("https://example.com").get();
// From HTML string
String html = "<html><head><title>Test</title></head><body><p>Hello</p></body></html>";
Document doc2 = Jsoup.parse(html);
// From file
File input = new File("/path/to/file.html");
Document doc3 = Jsoup.parse(input, "UTF-8", "https://example.com/");
CSS Selectors and Data Extraction
// Select elements by tag
Elements paragraphs = document.select("p");
// Select by class
Elements articles = document.select(".article");
// Select by ID
Element header = document.selectFirst("#header");
// Complex selectors
Elements productPrices = document.select("div.product .price");
// Extract attributes
for (Element link : document.select("a[href]")) {
String url = link.attr("href");
String text = link.text();
System.out.println(text + " -> " + url);
}
Connection Configuration
Document document = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0 (Android)")
.timeout(10000)
.followRedirects(true)
.header("Accept-Language", "en-US,en;q=0.9")
.cookie("session", "abc123")
.get();
Best Practices
Error Handling
Always wrap jsoup operations in try-catch blocks and handle network errors gracefully:
try {
Document document = Jsoup.connect(url).get();
// Process document
} catch (HttpStatusException e) {
Log.e("Jsoup", "HTTP error: " + e.getStatusCode());
} catch (SocketTimeoutException e) {
Log.e("Jsoup", "Timeout error");
} catch (IOException e) {
Log.e("Jsoup", "Connection error", e);
}
Performance Tips
- Set appropriate timeouts
- Use connection pooling for multiple requests
- Cache parsed documents when possible
- Limit the number of concurrent requests
Ethical Considerations
- Respect robots.txt files
- Add delays between requests
- Use appropriate User-Agent headers
- Follow website terms of service
- Consider rate limiting your requests
Limitations
- JavaScript content: jsoup only parses static HTML and cannot execute JavaScript
- Dynamic content: Content loaded via AJAX won't be available
- Real-time data: jsoup provides a snapshot of the page at request time
For JavaScript-heavy sites, consider using WebView or headless browsers like Chrome via tools like Selenium.
Troubleshooting
Common Issues
- SSL/TLS errors: Add appropriate network security config
- Timeout errors: Increase timeout values or check network connectivity
- Empty results: Verify CSS selectors and HTML structure
- 403/404 errors: Check URL validity and add proper headers
Debugging Tips
// Enable jsoup connection debugging
System.setProperty("java.net.useSystemProxies", "true");
// Log response details
Connection.Response response = Jsoup.connect(url).execute();
System.out.println("Status: " + response.statusCode());
System.out.println("Content-Type: " + response.contentType());