Table of contents

How do I use jsoup in an Android application?

Jsoup is a powerful Java library for parsing and manipulating HTML documents. It provides a convenient API to extract and manipulate data using DOM, CSS selectors, and jQuery-like methods. This guide shows how to integrate jsoup into your Android application.

Step 1: Add jsoup Dependency

Add the jsoup library to your build.gradle (Module: app) file in the dependencies section:

dependencies {
    implementation 'org.jsoup:jsoup:1.17.2' // Use the latest version
    // other dependencies...
}

After adding the dependency, sync your project by clicking "Sync Now" in Android Studio.

Step 2: Add Internet Permission

Add the Internet permission to your AndroidManifest.xml file:

<uses-permission android:name="android.permission.INTERNET" />

For apps targeting Android 9 (API level 28) and higher, also add network security config for HTTP URLs:

<application
    android:networkSecurityConfig="@xml/network_security_config"
    ...>
</application>

Create res/xml/network_security_config.xml:

<?xml version="1.0" encoding="utf-8"?>
<network-security-config>
    <domain-config cleartextTrafficPermitted="true">
        <domain includeSubdomains="true">example.com</domain>
    </domain-config>
</network-security-config>

Step 3: Modern Threading with ExecutorService

Since AsyncTask is deprecated, use modern threading approaches. Here's an example using ExecutorService:

Java Implementation

public class MainActivity extends AppCompatActivity {
    private ExecutorService executor;
    private Handler mainHandler;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        executor = Executors.newFixedThreadPool(4);
        mainHandler = new Handler(Looper.getMainLooper());

        fetchWebsiteData();
    }

    private void fetchWebsiteData() {
        executor.execute(() -> {
            try {
                // Connect to the website
                Document document = Jsoup.connect("https://example.com")
                    .userAgent("Mozilla/5.0 (Android)")
                    .timeout(10000)
                    .get();

                // Extract data
                String title = document.title();
                Elements links = document.select("a[href]");

                // Update UI on main thread
                mainHandler.post(() -> {
                    TextView titleView = findViewById(R.id.titleTextView);
                    titleView.setText(title);

                    TextView linksCount = findViewById(R.id.linksCountTextView);
                    linksCount.setText("Links found: " + links.size());
                });

            } catch (IOException e) {
                Log.e("MainActivity", "Error fetching website", e);
                mainHandler.post(() -> {
                    Toast.makeText(this, "Error loading website", Toast.LENGTH_SHORT).show();
                });
            }
        });
    }

    @Override
    protected void onDestroy() {
        super.onDestroy();
        executor.shutdown();
    }
}

Kotlin with Coroutines

class MainActivity : AppCompatActivity() {

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)

        fetchWebsiteData()
    }

    private fun fetchWebsiteData() {
        lifecycleScope.launch {
            try {
                val result = withContext(Dispatchers.IO) {
                    val document = Jsoup.connect("https://example.com")
                        .userAgent("Mozilla/5.0 (Android)")
                        .timeout(10000)
                        .get()

                    WebsiteData(
                        title = document.title(),
                        linkCount = document.select("a[href]").size,
                        description = document.select("meta[name=description]").attr("content")
                    )
                }

                // Update UI (automatically on main thread)
                findViewById<TextView>(R.id.titleTextView).text = result.title
                findViewById<TextView>(R.id.linksCountTextView).text = "Links: ${result.linkCount}"
                findViewById<TextView>(R.id.descriptionTextView).text = result.description

            } catch (e: IOException) {
                Log.e("MainActivity", "Error fetching website", e)
                Toast.makeText(this@MainActivity, "Error loading website", Toast.LENGTH_SHORT).show()
            }
        }
    }
}

data class WebsiteData(
    val title: String,
    val linkCount: Int,
    val description: String
)

Advanced jsoup Usage

Parsing HTML from Different Sources

// From URL
Document doc1 = Jsoup.connect("https://example.com").get();

// From HTML string
String html = "<html><head><title>Test</title></head><body><p>Hello</p></body></html>";
Document doc2 = Jsoup.parse(html);

// From file
File input = new File("/path/to/file.html");
Document doc3 = Jsoup.parse(input, "UTF-8", "https://example.com/");

CSS Selectors and Data Extraction

// Select elements by tag
Elements paragraphs = document.select("p");

// Select by class
Elements articles = document.select(".article");

// Select by ID
Element header = document.selectFirst("#header");

// Complex selectors
Elements productPrices = document.select("div.product .price");

// Extract attributes
for (Element link : document.select("a[href]")) {
    String url = link.attr("href");
    String text = link.text();
    System.out.println(text + " -> " + url);
}

Connection Configuration

Document document = Jsoup.connect("https://example.com")
    .userAgent("Mozilla/5.0 (Android)")
    .timeout(10000)
    .followRedirects(true)
    .header("Accept-Language", "en-US,en;q=0.9")
    .cookie("session", "abc123")
    .get();

Best Practices

Error Handling

Always wrap jsoup operations in try-catch blocks and handle network errors gracefully:

try {
    Document document = Jsoup.connect(url).get();
    // Process document
} catch (HttpStatusException e) {
    Log.e("Jsoup", "HTTP error: " + e.getStatusCode());
} catch (SocketTimeoutException e) {
    Log.e("Jsoup", "Timeout error");
} catch (IOException e) {
    Log.e("Jsoup", "Connection error", e);
}

Performance Tips

  • Set appropriate timeouts
  • Use connection pooling for multiple requests
  • Cache parsed documents when possible
  • Limit the number of concurrent requests

Ethical Considerations

  • Respect robots.txt files
  • Add delays between requests
  • Use appropriate User-Agent headers
  • Follow website terms of service
  • Consider rate limiting your requests

Limitations

  • JavaScript content: jsoup only parses static HTML and cannot execute JavaScript
  • Dynamic content: Content loaded via AJAX won't be available
  • Real-time data: jsoup provides a snapshot of the page at request time

For JavaScript-heavy sites, consider using WebView or headless browsers like Chrome via tools like Selenium.

Troubleshooting

Common Issues

  1. SSL/TLS errors: Add appropriate network security config
  2. Timeout errors: Increase timeout values or check network connectivity
  3. Empty results: Verify CSS selectors and HTML structure
  4. 403/404 errors: Check URL validity and add proper headers

Debugging Tips

// Enable jsoup connection debugging
System.setProperty("java.net.useSystemProxies", "true");

// Log response details
Connection.Response response = Jsoup.connect(url).execute();
System.out.println("Status: " + response.statusCode());
System.out.println("Content-Type: " + response.contentType());

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon