What is the difference between jsoup.connect() and jsoup.parse()?

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. Jsoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do.

When we scrape HTML content, we often need to either fetch the content from a URL or parse an HTML string that we already have. This is where jsoup.connect() and jsoup.parse() come into play, serving different purposes:

jsoup.connect()

jsoup.connect() is a method that creates a connection to a URL and fetches the web page content. After establishing a connection, it can be used to get a Document object that represents the fetched HTML. This method is useful when you want to scrape content directly from a live website.

Here's a basic example of how to use jsoup.connect():

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class WebScraper {
    public static void main(String[] args) {
        try {
            String url = "http://example.com";
            Document doc = Jsoup.connect(url).get();
            System.out.println(doc.title());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

jsoup.parse()

jsoup.parse() is a method that parses a string of HTML into a Document object. This method is useful when you already have the HTML content (for example, if you've downloaded it previously or if it's a part of a static file) and you want to parse and work with this HTML in your program.

Here's an example of how to use jsoup.parse():

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlParser {
    public static void main(String[] args) {
        String html = "<html><head><title>First parse</title></head>"
                    + "<body><p>Parsed HTML into a doc.</p></body></html>";
        Document doc = Jsoup.parse(html);
        System.out.println(doc.title());
    }
}

Summary

  • jsoup.connect(): Fetches and parses HTML from a live URL.
  • jsoup.parse(): Parses an HTML string into a Document.

Both methods return a Document object, which can then be used to traverse and manipulate the HTML using the jsoup API. The choice between connect() and parse() depends on the source of your HTML content. If you're working with a URL, connect() is the way to go; if you have an HTML string, use parse().

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon