Jsoup is a popular Java library for working with HTML documents. It provides a very convenient API to extract and manipulate data using the best of DOM, CSS, and jQuery-like methods. To use jsoup in an Android application, follow these steps:
Step 1: Add jsoup Dependency
First, you need to add the jsoup library to your project. You can do this by adding the following line to your build.gradle
(Module: app) file inside the dependencies
section:
dependencies {
implementation 'org.jsoup:jsoup:1.14.3' // Use the latest version available
// other dependencies...
}
After adding the dependency, click on "Sync Now" in the bar that appears at the top to sync your project with the updated Gradle files.
Step 2: Internet Permission
Since web scraping usually involves network operations, you need to ensure that your Android app has permission to access the Internet. Add the following line to your AndroidManifest.xml
:
<uses-permission android:name="android.permission.INTERNET" />
Step 3: Use jsoup in Your Activity or Fragment
You can use jsoup to parse HTML from a string, a file, or directly from a URL. However, network operations in Android should not be performed on the main thread. You'll need to use AsyncTask, Thread, HandlerThread, or an equivalent concurrency utility. Here, we'll use AsyncTask for simplicity:
public class MainActivity extends AppCompatActivity {
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
new FetchWebsiteData().execute();
}
private class FetchWebsiteData extends AsyncTask<Void, Void, Void> {
String title;
@Override
protected Void doInBackground(Void... params) {
try {
// Connect to the website
Document document = Jsoup.connect("http://example.com").get();
// Get the html document title
title = document.title();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
@Override
protected void onPostExecute(Void aVoid) {
super.onPostExecute(aVoid);
// Set title into TextView or any other UI element
TextView titleView = findViewById(R.id.titleTextView);
titleView.setText(title);
}
}
}
Notes:
- AsyncTask is deprecated as of Android 11 (API level 30). It's recommended to use more modern approaches like
java.util.concurrent
or Kotlin Coroutines for background operations. - Always remember to handle permissions and exceptions properly, as trying to connect to a network can throw
IOExceptions
. - Since web scraping can be resource-intensive and potentially disruptive to the target website, always use it responsibly and ethically. Respect
robots.txt
rules and website terms of service. - Be aware that dynamically loaded content via JavaScript won't be available to jsoup when fetching HTML directly from a URL. For such cases, you might need to use a WebView or a headless browser that can execute JavaScript.
Remember to replace "http://example.com"
with the URL you intend to scrape, and modify the scraping logic according to the structure of the HTML you are working with.