HtmlUnit is a "GUI-less" browser for Java programs, often used for web scraping, web testing, and browser automation. It can simulate a web browser, including JavaScript execution, AJAX requests, and more. Here's how you can use HtmlUnit to extract data from a web page:
Step 1: Set up Maven Dependency
If you use Maven, you can add the HtmlUnit dependency to your pom.xml
file:
<dependencies>
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.50.0</version> <!-- Make sure to use the latest version -->
</dependency>
</dependencies>
For non-Maven users, you'll need to download the HtmlUnit JAR files and include them in your project's classpath.
Step 2: Create a WebClient Instance
The WebClient
class is the starting point for using HtmlUnit. It represents a web browser.
import com.gargoylesoftware.htmlunit.WebClient;
public class HtmlUnitExample {
public static void main(String[] args) {
try (final WebClient webClient = new WebClient()) {
// Configure the webClient if needed
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
// Rest of the code goes here
}
}
}
Step 3: Load a Web Page
Using the WebClient
, you can load a page by calling the getPage
method:
import com.gargoylesoftware.htmlunit.html.HtmlPage;
// ...
HtmlPage page = webClient.getPage("http://example.com");
Step 4: Extract Data
Once you have the HtmlPage
object, you can extract data by using XPath, CSS Selectors, or by working with the DOM API provided by HtmlUnit.
XPath Example
import com.gargoylesoftware.htmlunit.html.HtmlElement;
// ...
HtmlElement element = page.getFirstByXPath("//div[@id='content']");
if (element != null) {
String contentText = element.asText();
System.out.println(contentText);
}
CSS Selectors Example
import java.util.List;
// ...
List<HtmlElement> items = page.getByXPath("//div[@class='item']");
for (HtmlElement item : items) {
HtmlElement title = item.getFirstByXPath(".//h2[@class='title']");
if (title != null) {
System.out.println(title.asText());
}
}
DOM API Example
import com.gargoylesoftware.htmlunit.html.HtmlDivision;
import com.gargoylesoftware.htmlunit.html.HtmlHeading2;
// ...
HtmlDivision div = page.getHtmlElementById("content");
HtmlHeading2 heading = div.getFirstByXPath("./h2");
if (heading != null) {
System.out.println(heading.asText());
}
Step 5: Close the WebClient
It's good practice to close the WebClient
when you're done with it to free up system resources:
webClient.close();
Or, as shown in the earlier examples, you can use try-with-resources which will automatically close the WebClient
.
Error Handling
Make sure to handle exceptions properly. HtmlUnit can throw various exceptions such as IOException
for network errors, FailingHttpStatusCodeException
for error HTTP status codes, etc.
try {
// ... HtmlUnit operations ...
} catch (Exception e) {
e.printStackTrace();
}
Conclusion
HtmlUnit provides a high-level API to interact with web pages like a real browser, which makes it powerful for web scraping tasks. It can handle JavaScript and AJAX, deal with forms, and navigate through pages just like a human user, but without the overhead of a graphical interface. Remember to always respect the terms of service and robots.txt of the websites you scrape data from.