Jsoup is a popular Java library for working with HTML documents. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods.
When scraping web pages with Jsoup, you might encounter situations where your requests are being blocked by the server, or you want to anonymize your requests. In such cases, using a proxy server can be a solution. Here's how you can configure Jsoup to use a proxy server:
Java System Properties
One of the simplest ways to set a proxy for Jsoup is by setting system properties. These properties will be used by the underlying HttpURLConnection
. Add the following lines before you make a connection with Jsoup:
System.setProperty("http.proxyHost", "your.proxy.host");
System.setProperty("http.proxyPort", "your.proxy.port");
// If your proxy also requires authentication
System.setProperty("http.proxyUser", "your.proxy.user");
System.setProperty("http.proxyPassword", "your.proxy.password");
And then use Jsoup as you normally would:
String url = "http://example.com";
Document doc = Jsoup.connect(url).get();
// ... do something with your document
Jsoup Connection Configuration
If you prefer not to use system properties, or if you want to use different proxies for different connections, you can configure the proxy directly on the Jsoup connection:
String url = "http://example.com";
Document doc = Jsoup.connect(url)
.proxy("your.proxy.host", your.proxy.port)
.get();
// ... do something with your document
Proxy Authentication
If the proxy server requires authentication, you will need to set the Authenticator
. Here's an example of how you can do this:
final String proxyUser = "your.proxy.user";
final String proxyPassword = "your.proxy.password";
Authenticator.setDefault(
new Authenticator() {
@Override
public PasswordAuthentication getPasswordAuthentication() {
if (getRequestorType() == RequestorType.PROXY) {
return new PasswordAuthentication(proxyUser, proxyPassword.toCharArray());
}
return null;
}
}
);
String url = "http://example.com";
Document doc = Jsoup.connect(url)
.proxy("your.proxy.host", your.proxy.port)
.get();
// ... do something with your document
Using Socks Proxy
If you're using a SOCKS proxy instead of an HTTP proxy, you can set the properties as follows:
System.setProperty("socksProxyHost", "your.socks.proxy.host");
System.setProperty("socksProxyPort", "your.socks.proxy.port");
// Use Jsoup as usual
String url = "http://example.com";
Document doc = Jsoup.connect(url).get();
// ... do something with your document
Remember to replace "your.proxy.host"
, "your.proxy.port"
, "your.proxy.user"
, and "your.proxy.password"
with the actual host, port, username, and password of your proxy server.
Note on Proxy Authentication:
If your proxy requires authentication, the setup might be a bit more complex and you might need to handle it manually by setting appropriate headers, or by using a custom Authenticator
as shown above. Some proxies might also use a different authentication mechanism, such as NTLM, which would require additional setup.
Important Security Note: Be cautious when using proxies, especially free or public ones, as they can intercept your traffic. Always ensure that the proxy you are using is from a trustworthy source, especially when dealing with sensitive data.