Jsoup is a popular Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. When scraping web pages with jsoup, you might encounter redirects. By default, jsoup follows HTTP redirects for up to 10 redirects.
However, you might want to handle redirects differently, such as:
- Checking if a redirect has occurred.
- Capturing the URL you were redirected to.
- Limiting the number of redirects.
- Disabling redirects entirely.
Below are examples of how to handle redirects when scraping with jsoup:
1. Checking for Redirects and Capturing the Final URL
You can check if a redirect has occurred by comparing the requested URL with the final URL after executing the request. Here's how:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupHandleRedirects {
public static void main(String[] args) {
try {
String originalUrl = "http://example.com";
Connection connection = Jsoup.connect(originalUrl).followRedirects(true);
Document doc = connection.get();
String finalUrl = connection.response().url().toString();
if (!originalUrl.equals(finalUrl)) {
System.out.println("Redirect occurred!");
System.out.println("Final URL: " + finalUrl);
} else {
System.out.println("No redirect, original URL is the final URL.");
}
// Use the document as needed...
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. Limiting the Number of Redirects
To limit the number of redirects jsoup will follow, you can use the maxBodySize
method:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupLimitRedirects {
public static void main(String[] args) {
try {
String url = "http://example.com";
// Set max redirects to 5
Document doc = Jsoup.connect(url).followRedirects(true).maxBodySize(5).get();
// Use the document as needed...
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. Disabling Redirects
If you want to handle redirects manually, you can disable the automatic following of redirects:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
public class JsoupDisableRedirects {
public static void main(String[] args) {
try {
String url = "http://example.com";
Connection.Response response = Jsoup.connect(url)
.followRedirects(false) // Disable redirects
.execute();
if (response.hasHeader("location")) {
String redirectUrl = response.header("location");
System.out.println("Redirect to: " + redirectUrl);
// Optionally, follow the redirect manually
// Connection.Response newResponse = Jsoup.connect(redirectUrl).execute();
// // ...
} else {
System.out.println("No redirect, process the response as needed.");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Remember to handle the redirects according to your specific scraping task requirements, and always respect the website's robots.txt
file and terms of service. Additionally, be aware of the legal implications of web scraping and proceed ethically.