How can I use jsoup to extract structured data like JSON-LD or microdata?
Structured data is essential for modern web scraping as it provides machine-readable information about page content. This guide demonstrates how to use jsoup to extract various types of structured data including JSON-LD, microdata, RDFa, and OpenGraph meta tags from web pages.
Understanding Structured Data Types
JSON-LD (JavaScript Object Notation for Linked Data)
JSON-LD is the most common structured data format, embedded in <script>
tags with type="application/ld+json"
.
Microdata
Microdata uses HTML attributes like itemscope
, itemtype
, and itemprop
to embed structured data directly in HTML elements.
RDFa (Resource Description Framework in Attributes)
RDFa uses attributes like typeof
, property
, and content
to add semantic meaning to HTML elements.
Extracting JSON-LD Data
JSON-LD is the easiest structured data format to extract with jsoup. Here's how to parse it:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
public class JsonLdExtractor {
public static void main(String[] args) throws Exception {
String url = "https://example.com/product-page";
Document doc = Jsoup.connect(url).get();
// Select all JSON-LD script tags
Elements jsonLdScripts = doc.select("script[type=application/ld+json]");
ObjectMapper mapper = new ObjectMapper();
for (Element script : jsonLdScripts) {
String jsonContent = script.html();
try {
JsonNode jsonNode = mapper.readTree(jsonContent);
// Extract specific data based on schema type
if (jsonNode.has("@type")) {
String type = jsonNode.get("@type").asText();
switch (type) {
case "Product":
extractProductData(jsonNode);
break;
case "Article":
extractArticleData(jsonNode);
break;
case "Organization":
extractOrganizationData(jsonNode);
break;
default:
System.out.println("Unknown type: " + type);
}
}
} catch (Exception e) {
System.err.println("Error parsing JSON-LD: " + e.getMessage());
}
}
}
private static void extractProductData(JsonNode product) {
String name = product.path("name").asText();
String description = product.path("description").asText();
String brand = product.path("brand").path("name").asText();
JsonNode offers = product.path("offers");
String price = offers.path("price").asText();
String currency = offers.path("priceCurrency").asText();
System.out.printf("Product: %s%nBrand: %s%nPrice: %s %s%nDescription: %s%n",
name, brand, price, currency, description);
}
private static void extractArticleData(JsonNode article) {
String headline = article.path("headline").asText();
String author = article.path("author").path("name").asText();
String datePublished = article.path("datePublished").asText();
System.out.printf("Article: %s%nAuthor: %s%nPublished: %s%n",
headline, author, datePublished);
}
private static void extractOrganizationData(JsonNode org) {
String name = org.path("name").asText();
String url = org.path("url").asText();
String description = org.path("description").asText();
System.out.printf("Organization: %s%nURL: %s%nDescription: %s%n",
name, url, description);
}
}
Extracting Microdata
Microdata requires parsing HTML attributes to extract structured information:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.HashMap;
import java.util.Map;
public class MicrodataExtractor {
public static void main(String[] args) throws Exception {
String url = "https://example.com/microdata-page";
Document doc = Jsoup.connect(url).get();
// Find all elements with itemscope
Elements itemScopes = doc.select("[itemscope]");
for (Element scope : itemScopes) {
String itemType = scope.attr("itemtype");
Map<String, String> properties = new HashMap<>();
// Extract properties from this scope
Elements props = scope.select("[itemprop]");
for (Element prop : props) {
String propertyName = prop.attr("itemprop");
String propertyValue = extractPropertyValue(prop);
properties.put(propertyName, propertyValue);
}
System.out.println("ItemType: " + itemType);
properties.forEach((key, value) ->
System.out.println(" " + key + ": " + value));
System.out.println();
}
}
private static String extractPropertyValue(Element element) {
// Check for specific value attributes first
if (element.hasAttr("content")) {
return element.attr("content");
} else if (element.hasAttr("datetime")) {
return element.attr("datetime");
} else if (element.hasAttr("href")) {
return element.attr("href");
} else if (element.hasAttr("src")) {
return element.attr("src");
} else {
// Fall back to text content
return element.text().trim();
}
}
}
Advanced Microdata Extraction with Nested Items
Handle complex microdata structures with nested items:
import java.util.ArrayList;
import java.util.List;
public class AdvancedMicrodataExtractor {
public static class MicrodataItem {
private String type;
private Map<String, Object> properties;
public MicrodataItem(String type) {
this.type = type;
this.properties = new HashMap<>();
}
// Getters and setters
public String getType() { return type; }
public Map<String, Object> getProperties() { return properties; }
}
public static void main(String[] args) throws Exception {
String url = "https://example.com/complex-microdata";
Document doc = Jsoup.connect(url).get();
Elements topLevelScopes = doc.select("[itemscope]:not([itemscope] [itemscope])");
for (Element scope : topLevelScopes) {
MicrodataItem item = extractMicrodataItem(scope);
System.out.println("Extracted item: " + item.getType());
printProperties(item.getProperties(), 0);
}
}
private static MicrodataItem extractMicrodataItem(Element scope) {
String itemType = scope.attr("itemtype");
MicrodataItem item = new MicrodataItem(itemType);
Elements directProps = scope.select("> [itemprop], [itemprop]:not([itemscope] [itemprop])");
for (Element prop : directProps) {
String propName = prop.attr("itemprop");
if (prop.hasAttr("itemscope")) {
// Nested microdata item
MicrodataItem nestedItem = extractMicrodataItem(prop);
item.getProperties().put(propName, nestedItem);
} else {
// Simple property
String value = extractPropertyValue(prop);
item.getProperties().put(propName, value);
}
}
return item;
}
private static void printProperties(Map<String, Object> properties, int indent) {
String indentStr = " ".repeat(indent);
for (Map.Entry<String, Object> entry : properties.entrySet()) {
if (entry.getValue() instanceof MicrodataItem) {
MicrodataItem nested = (MicrodataItem) entry.getValue();
System.out.println(indentStr + entry.getKey() + " (" + nested.getType() + "):");
printProperties(nested.getProperties(), indent + 1);
} else {
System.out.println(indentStr + entry.getKey() + ": " + entry.getValue());
}
}
}
private static String extractPropertyValue(Element element) {
if (element.hasAttr("content")) return element.attr("content");
if (element.hasAttr("datetime")) return element.attr("datetime");
if (element.hasAttr("href")) return element.attr("href");
if (element.hasAttr("src")) return element.attr("src");
return element.text().trim();
}
}
Extracting OpenGraph and Meta Tags
OpenGraph meta tags provide social media-friendly structured data:
public class MetaTagExtractor {
public static void main(String[] args) throws Exception {
String url = "https://example.com/social-page";
Document doc = Jsoup.connect(url).get();
// Extract OpenGraph tags
Map<String, String> openGraph = new HashMap<>();
Elements ogTags = doc.select("meta[property^=og:]");
for (Element tag : ogTags) {
String property = tag.attr("property").substring(3); // Remove "og:" prefix
String content = tag.attr("content");
openGraph.put(property, content);
}
// Extract Twitter Card tags
Map<String, String> twitterCard = new HashMap<>();
Elements twitterTags = doc.select("meta[name^=twitter:]");
for (Element tag : twitterTags) {
String name = tag.attr("name").substring(8); // Remove "twitter:" prefix
String content = tag.attr("content");
twitterCard.put(name, content);
}
// Extract standard meta tags
Map<String, String> metaTags = new HashMap<>();
Elements standardMeta = doc.select("meta[name]");
for (Element tag : standardMeta) {
String name = tag.attr("name");
String content = tag.attr("content");
if (!name.startsWith("twitter:")) {
metaTags.put(name, content);
}
}
System.out.println("OpenGraph Data:");
openGraph.forEach((key, value) -> System.out.println(" og:" + key + " = " + value));
System.out.println("\nTwitter Card Data:");
twitterCard.forEach((key, value) -> System.out.println(" twitter:" + key + " = " + value));
System.out.println("\nStandard Meta Tags:");
metaTags.forEach((key, value) -> System.out.println(" " + key + " = " + value));
}
}
Complete Structured Data Extractor
Here's a comprehensive extractor that handles multiple structured data formats:
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class UniversalStructuredDataExtractor {
private final ObjectMapper jsonMapper;
private final ExecutorService executor;
public UniversalStructuredDataExtractor() {
this.jsonMapper = new ObjectMapper();
this.executor = Executors.newFixedThreadPool(4);
}
public void extractAllStructuredData(String url) throws Exception {
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (compatible; StructuredDataBot/1.0)")
.timeout(10000)
.get();
// Extract different types of structured data concurrently
CompletableFuture<Void> jsonLdFuture = CompletableFuture.runAsync(() -> {
try {
extractJsonLd(doc);
} catch (Exception e) {
System.err.println("JSON-LD extraction failed: " + e.getMessage());
}
}, executor);
CompletableFuture<Void> microdataFuture = CompletableFuture.runAsync(() -> {
extractMicrodata(doc);
}, executor);
CompletableFuture<Void> metaFuture = CompletableFuture.runAsync(() -> {
extractMetaTags(doc);
}, executor);
CompletableFuture<Void> rdFaFuture = CompletableFuture.runAsync(() -> {
extractRDFa(doc);
}, executor);
// Wait for all extractions to complete
CompletableFuture.allOf(jsonLdFuture, microdataFuture, metaFuture, rdFaFuture).join();
}
private void extractJsonLd(Document doc) throws Exception {
Elements scripts = doc.select("script[type=application/ld+json]");
System.out.println("=== JSON-LD Data ===");
for (Element script : scripts) {
try {
JsonNode json = jsonMapper.readTree(script.html());
System.out.println(jsonMapper.writerWithDefaultPrettyPrinter().writeValueAsString(json));
} catch (Exception e) {
System.err.println("Failed to parse JSON-LD: " + e.getMessage());
}
}
}
private void extractMicrodata(Document doc) {
Elements scopes = doc.select("[itemscope]");
System.out.println("\n=== Microdata ===");
for (Element scope : scopes) {
if (!scope.parents().select("[itemscope]").isEmpty()) {
continue; // Skip nested items, they'll be handled recursively
}
MicrodataItem item = extractMicrodataItem(scope);
System.out.println("Type: " + item.getType());
printProperties(item.getProperties(), 1);
}
}
private void extractMetaTags(Document doc) {
System.out.println("\n=== Meta Tags ===");
// OpenGraph
Elements ogTags = doc.select("meta[property^=og:]");
if (!ogTags.isEmpty()) {
System.out.println("OpenGraph:");
ogTags.forEach(tag -> System.out.println(" " + tag.attr("property") + " = " + tag.attr("content")));
}
// Twitter Cards
Elements twitterTags = doc.select("meta[name^=twitter:]");
if (!twitterTags.isEmpty()) {
System.out.println("Twitter Cards:");
twitterTags.forEach(tag -> System.out.println(" " + tag.attr("name") + " = " + tag.attr("content")));
}
// Standard meta tags
Elements metaTags = doc.select("meta[name]:not([name^=twitter:])");
if (!metaTags.isEmpty()) {
System.out.println("Standard Meta:");
metaTags.forEach(tag -> System.out.println(" " + tag.attr("name") + " = " + tag.attr("content")));
}
}
private void extractRDFa(Document doc) {
System.out.println("\n=== RDFa Data ===");
Elements rdFaElements = doc.select("[typeof], [property]");
for (Element element : rdFaElements) {
if (element.hasAttr("typeof")) {
System.out.println("Type: " + element.attr("typeof"));
}
if (element.hasAttr("property")) {
String property = element.attr("property");
String content = element.hasAttr("content") ?
element.attr("content") : element.text();
System.out.println(" " + property + " = " + content);
}
}
}
public void shutdown() {
executor.shutdown();
}
}
Best Practices and Error Handling
Robust JSON-LD Parsing
When parsing JSON-LD, always handle malformed JSON gracefully:
private static List<JsonNode> parseJsonLdSafely(Document doc) {
List<JsonNode> results = new ArrayList<>();
Elements scripts = doc.select("script[type=application/ld+json]");
for (Element script : scripts) {
String content = script.html().trim();
if (content.isEmpty()) continue;
try {
// Handle both single objects and arrays
JsonNode node = jsonMapper.readTree(content);
if (node.isArray()) {
node.forEach(results::add);
} else {
results.add(node);
}
} catch (Exception e) {
System.err.println("Skipping malformed JSON-LD: " + e.getMessage());
// Log the problematic content for debugging
System.err.println("Content: " + content.substring(0, Math.min(100, content.length())));
}
}
return results;
}
Performance Optimization
For large-scale scraping, optimize your extraction process:
public class OptimizedExtractor {
private static final int TIMEOUT_MS = 10000;
private static final String USER_AGENT = "Mozilla/5.0 (compatible; DataExtractor/1.0)";
public StructuredData extractWithCache(String url, boolean useCache) throws Exception {
// Implement caching logic here
if (useCache) {
StructuredData cached = getCachedData(url);
if (cached != null) return cached;
}
Document doc = Jsoup.connect(url)
.userAgent(USER_AGENT)
.timeout(TIMEOUT_MS)
.followRedirects(true)
.maxBodySize(1024 * 1024) // 1MB limit
.get();
StructuredData data = new StructuredData();
// Extract only what you need
data.setJsonLd(extractJsonLdData(doc));
data.setMicrodata(extractMicrodataData(doc));
data.setMetaTags(extractMetaData(doc));
if (useCache) {
cacheData(url, data);
}
return data;
}
}
Integration with Modern Web Applications
When working with JavaScript-heavy sites that dynamically load structured data, consider combining jsoup with other tools. While jsoup handles static HTML efficiently, some websites require JavaScript execution to populate structured data.
For dynamic content, you might need to first render the page with a headless browser before using jsoup to parse the resulting HTML. This approach ensures you capture all structured data, including that loaded via AJAX requests.
Conclusion
Jsoup provides powerful capabilities for extracting structured data from web pages. By combining JSON-LD parsing, microdata extraction, and meta tag analysis, you can build comprehensive data extraction systems. Remember to handle errors gracefully, implement proper caching for performance, and always respect robots.txt and rate limiting when scraping at scale.
The techniques shown here form the foundation for building robust web scraping applications that can extract rich, structured information from modern websites. Whether you're building a price monitoring system, content aggregator, or SEO analysis tool, these structured data extraction methods will help you gather the precise information you need.