HtmlUnit is a Java library designed to provide an API that enables Java programs to simulate a web browser. It can be used to perform tasks such as scraping web content, testing web applications, or any interaction with web pages programmatically.
To save a web page to the local file system using HtmlUnit, you'd typically perform the following steps:
- Create a
WebClient
instance to simulate a browser. - Navigate to the desired URL by using the
getPage
method. - Retrieve the page's content.
- Write the content to a local file.
Here's an example of how you might implement this in Java:
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class HtmlUnitSavePageExample {
public static void main(String[] args) {
// Create a new WebClient instance
try (final WebClient webClient = new WebClient()) {
// Optionally, you can add configuration to the webClient here, like:
// webClient.getOptions().setJavaScriptEnabled(false);
// Get the page
HtmlPage page = webClient.getPage("http://example.com");
// Get the page as XML (which represents the DOM)
String pageAsXml = page.asXml();
// Alternatively, get the page as plain text
String pageAsText = page.asText();
// Save the page content to a local file
File file = new File("savedPage.html");
try (FileWriter writer = new FileWriter(file)) {
writer.write(pageAsXml); // or use pageAsText if you want plain text
System.out.println("Page saved to " + file.getAbsolutePath());
} catch (IOException e) {
e.printStackTrace();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
In this code:
- A
WebClient
object is created to simulate a browser. - We navigate to
http://example.com
by callinggetPage
. - We use
asXml()
on theHtmlPage
object to get the page's HTML content. If you prefer to get the plain text representation of the page, you can useasText()
instead. - We then write this content to a file named
savedPage.html
in the current directory.
Note that the try-with-resources
statement is used to ensure that resources are properly closed after the program is finished. This is particularly important for I/O operations and managing the WebClient
instance.
HtmlUnit provides a lot of configuration options to simulate various browser behaviors, such as JavaScript execution, cookie management, and more. You can configure your WebClient
instance according to your needs.
Remember to include the necessary dependencies in your project's build file (e.g., pom.xml
for Maven or build.gradle
for Gradle) to use HtmlUnit. Here's an example for Maven:
<dependencies>
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.61.0</version> <!-- Use the latest version available -->
</dependency>
</dependencies>
Replace the version with the latest version available at the time you're adding the dependency.