Jsoup is a Java library designed to parse, extract, and manipulate data from HTML documents. To extract data from a table using Jsoup, you will need to:
- Parse the HTML document to create a
Document
object. - Use the Jsoup selector syntax to find the table and its elements.
- Iterate through the rows and cells of the table to extract the required data.
Here's a step-by-step guide:
Step 1: Add Jsoup to Your Project
If you're using Maven, add the following dependency to your pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version> <!-- Use the latest version available -->
</dependency>
For Gradle, add this line to your build.gradle
:
implementation 'org.jsoup:jsoup:1.14.3'
Or download the JAR directly from the Jsoup website if you're not using a build tool that manages dependencies.
Step 2: Parse the HTML Document
You can parse an HTML document from a String
, a File
, or directly from a URL.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TableScraper {
public static void main(String[] args) {
// Example of parsing HTML from a URL
String url = "http://example.com/tablePage.html";
try {
Document document = Jsoup.connect(url).get();
// Now you can extract data from the document
} catch (IOException e) {
e.printStackTrace();
}
}
}
Step 3: Select the Table and Extract Data
Use the Jsoup selector API to find the table and iterate through its rows and cells to extract the data you need.
public class TableScraper {
public static void main(String[] args) {
String url = "http://example.com/tablePage.html";
try {
Document document = Jsoup.connect(url).get();
// Assuming the table has a unique ID
Element table = document.getElementById("myTable");
// If the table does not have an ID, you can use other selectors
// Elements tables = document.select("table.myClass"); // Using class
// Element firstTable = document.select("table").first(); // Select the first table
// Select the rows (tr elements) of the table
Elements rows = table.select("tr");
// Iterate over each row
for (Element row : rows) {
// Select the cells (td or th elements) of the row
Elements cells = row.select("th, td");
// Iterate over each cell
for (Element cell : cells) {
// Extract the text from the cell
String text = cell.text();
System.out.println(text);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
In this example, document.getElementById("myTable")
selects the table with the ID myTable
. The select("tr")
method is then used to get all the row elements within the table. The nested for-each loops iterate over each row and cell, extracting the text with cell.text()
.
Keep in mind that:
- If the table has a
thead
andtbody
, you might want to handle them separately. - If you're interested in extracting attributes or HTML content instead of text, use
cell.attr("attributeName")
orcell.html()
. - Adjust your selector syntax based on the specific structure of the HTML table you're working with.
This is a simple example, but Jsoup is very powerful and can handle much more complex scenarios with its full range of selector syntax and manipulation methods.