Yes, jsoup can be used to parse XML documents effectively. While jsoup is primarily designed for HTML parsing, it provides excellent XML parsing capabilities through its dedicated XML parser. The library offers a convenient API with DOM, CSS, and jQuery-like methods for XML data extraction and manipulation.
Key Differences: XML vs HTML Parsing
When parsing XML with jsoup, you must use Parser.xmlParser()
instead of the default HTML parser. This ensures:
- Case sensitivity: XML tag and attribute names are preserved exactly
- Self-closing tags: Properly handled according to XML standards
- Strict parsing: XML syntax rules are enforced
- Namespace preservation: XML namespaces are maintained
Basic XML String Parsing
Here's how to parse a simple XML string:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;
public class BasicXMLParsing {
public static void main(String[] args) {
String xml = """
<bookstore>
<book id="1" category="fiction">
<title>Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<price>12.99</price>
</book>
<book id="2" category="science">
<title>Brief History of Time</title>
<author>Stephen Hawking</author>
<price>15.99</price>
</book>
</bookstore>
""";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
// Select all books
Elements books = doc.select("book");
for (Element book : books) {
String id = book.attr("id");
String title = book.select("title").text();
String author = book.select("author").text();
String price = book.select("price").text();
System.out.printf("Book %s: %s by %s ($%s)%n",
id, title, author, price);
}
}
}
Parsing XML Files
To parse XML files, use the file-based parsing methods:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import java.io.File;
import java.io.IOException;
public class XMLFileParsing {
public static void main(String[] args) {
try {
// Parse from file
File xmlFile = new File("data/books.xml");
Document doc = Jsoup.parse(xmlFile, "UTF-8", "", Parser.xmlParser());
// Extract data using CSS selectors
String title = doc.select("book[id=1] title").text();
System.out.println("First book title: " + title);
} catch (IOException e) {
System.err.println("Error reading XML file: " + e.getMessage());
}
}
}
Parsing XML from URLs
You can also parse XML directly from web URLs:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import java.io.IOException;
public class XMLUrlParsing {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com/data.xml")
.parser(Parser.xmlParser())
.get();
// Process the XML document
System.out.println("Root element: " + doc.root().tagName());
} catch (IOException e) {
System.err.println("Error fetching XML: " + e.getMessage());
}
}
}
Working with XML Namespaces
jsoup preserves XML namespaces when using the XML parser:
public class XMLNamespaces {
public static void main(String[] args) {
String xml = """
<root xmlns:book="http://example.com/book">
<book:catalog>
<book:item id="1">
<book:title>XML Guide</book:title>
</book:item>
</book:catalog>
</root>
""";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
// Select elements with namespaces
Elements items = doc.select("book|item");
for (Element item : items) {
String title = item.select("book|title").text();
System.out.println("Title: " + title);
}
}
}
Modifying XML Documents
jsoup allows you to modify XML structure and content:
public class XMLModification {
public static void main(String[] args) {
String xml = "<catalog><book>Original Title</book></catalog>";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
// Modify existing content
doc.select("book").text("Updated Title");
// Add new elements
Element catalog = doc.select("catalog").first();
catalog.appendElement("book").text("New Book");
// Output modified XML
System.out.println(doc.outerHtml());
}
}
Best Practices and Limitations
When to Use jsoup for XML
- Simple to moderate XML parsing needs
- Web scraping XML content
- CSS selector-based data extraction
- Quick XML manipulation tasks
When to Consider Alternatives
For advanced XML processing, consider specialized libraries:
- JDOM2 or DOM4J: Full XML feature support
- javax.xml.parsers: Built-in Java XML APIs
- StAX: Streaming XML processing for large files
- XPath processors: Complex query requirements
Limitations
- No XPath expression support
- Limited XSLT transformation capabilities
- No XML Schema validation
- No advanced namespace handling
jsoup provides an excellent balance of simplicity and power for most XML parsing tasks, especially when you're already familiar with CSS selectors or need to parse XML as part of web scraping activities.