Can jsoup be used to parse XML documents?

Yes, jsoup can be used to parse XML documents effectively. While jsoup is primarily designed for HTML parsing, it provides excellent XML parsing capabilities through its dedicated XML parser. The library offers a convenient API with DOM, CSS, and jQuery-like methods for XML data extraction and manipulation.

Key Differences: XML vs HTML Parsing

When parsing XML with jsoup, you must use Parser.xmlParser() instead of the default HTML parser. This ensures:

  • Case sensitivity: XML tag and attribute names are preserved exactly
  • Self-closing tags: Properly handled according to XML standards
  • Strict parsing: XML syntax rules are enforced
  • Namespace preservation: XML namespaces are maintained

Basic XML String Parsing

Here's how to parse a simple XML string:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;

public class BasicXMLParsing {
    public static void main(String[] args) {
        String xml = """
            <bookstore>
                <book id="1" category="fiction">
                    <title>Great Gatsby</title>
                    <author>F. Scott Fitzgerald</author>
                    <price>12.99</price>
                </book>
                <book id="2" category="science">
                    <title>Brief History of Time</title>
                    <author>Stephen Hawking</author>
                    <price>15.99</price>
                </book>
            </bookstore>
            """;

        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());

        // Select all books
        Elements books = doc.select("book");
        for (Element book : books) {
            String id = book.attr("id");
            String title = book.select("title").text();
            String author = book.select("author").text();
            String price = book.select("price").text();

            System.out.printf("Book %s: %s by %s ($%s)%n", 
                id, title, author, price);
        }
    }
}

Parsing XML Files

To parse XML files, use the file-based parsing methods:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import java.io.File;
import java.io.IOException;

public class XMLFileParsing {
    public static void main(String[] args) {
        try {
            // Parse from file
            File xmlFile = new File("data/books.xml");
            Document doc = Jsoup.parse(xmlFile, "UTF-8", "", Parser.xmlParser());

            // Extract data using CSS selectors
            String title = doc.select("book[id=1] title").text();
            System.out.println("First book title: " + title);

        } catch (IOException e) {
            System.err.println("Error reading XML file: " + e.getMessage());
        }
    }
}

Parsing XML from URLs

You can also parse XML directly from web URLs:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import java.io.IOException;

public class XMLUrlParsing {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com/data.xml")
                    .parser(Parser.xmlParser())
                    .get();

            // Process the XML document
            System.out.println("Root element: " + doc.root().tagName());

        } catch (IOException e) {
            System.err.println("Error fetching XML: " + e.getMessage());
        }
    }
}

Working with XML Namespaces

jsoup preserves XML namespaces when using the XML parser:

public class XMLNamespaces {
    public static void main(String[] args) {
        String xml = """
            <root xmlns:book="http://example.com/book">
                <book:catalog>
                    <book:item id="1">
                        <book:title>XML Guide</book:title>
                    </book:item>
                </book:catalog>
            </root>
            """;

        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());

        // Select elements with namespaces
        Elements items = doc.select("book|item");
        for (Element item : items) {
            String title = item.select("book|title").text();
            System.out.println("Title: " + title);
        }
    }
}

Modifying XML Documents

jsoup allows you to modify XML structure and content:

public class XMLModification {
    public static void main(String[] args) {
        String xml = "<catalog><book>Original Title</book></catalog>";

        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());

        // Modify existing content
        doc.select("book").text("Updated Title");

        // Add new elements
        Element catalog = doc.select("catalog").first();
        catalog.appendElement("book").text("New Book");

        // Output modified XML
        System.out.println(doc.outerHtml());
    }
}

Best Practices and Limitations

When to Use jsoup for XML

  • Simple to moderate XML parsing needs
  • Web scraping XML content
  • CSS selector-based data extraction
  • Quick XML manipulation tasks

When to Consider Alternatives

For advanced XML processing, consider specialized libraries:

  • JDOM2 or DOM4J: Full XML feature support
  • javax.xml.parsers: Built-in Java XML APIs
  • StAX: Streaming XML processing for large files
  • XPath processors: Complex query requirements

Limitations

  • No XPath expression support
  • Limited XSLT transformation capabilities
  • No XML Schema validation
  • No advanced namespace handling

jsoup provides an excellent balance of simplicity and power for most XML parsing tasks, especially when you're already familiar with CSS selectors or need to parse XML as part of web scraping activities.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon