Table of contents

How can I use jsoup to extract structured data like JSON-LD or microdata?

Structured data is essential for modern web scraping as it provides machine-readable information about page content. This guide demonstrates how to use jsoup to extract various types of structured data including JSON-LD, microdata, RDFa, and OpenGraph meta tags from web pages.

Understanding Structured Data Types

JSON-LD (JavaScript Object Notation for Linked Data)

JSON-LD is the most common structured data format, embedded in <script> tags with type="application/ld+json".

Microdata

Microdata uses HTML attributes like itemscope, itemtype, and itemprop to embed structured data directly in HTML elements.

RDFa (Resource Description Framework in Attributes)

RDFa uses attributes like typeof, property, and content to add semantic meaning to HTML elements.

Extracting JSON-LD Data

JSON-LD is the easiest structured data format to extract with jsoup. Here's how to parse it:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

public class JsonLdExtractor {
    public static void main(String[] args) throws Exception {
        String url = "https://example.com/product-page";
        Document doc = Jsoup.connect(url).get();

        // Select all JSON-LD script tags
        Elements jsonLdScripts = doc.select("script[type=application/ld+json]");

        ObjectMapper mapper = new ObjectMapper();

        for (Element script : jsonLdScripts) {
            String jsonContent = script.html();
            try {
                JsonNode jsonNode = mapper.readTree(jsonContent);

                // Extract specific data based on schema type
                if (jsonNode.has("@type")) {
                    String type = jsonNode.get("@type").asText();

                    switch (type) {
                        case "Product":
                            extractProductData(jsonNode);
                            break;
                        case "Article":
                            extractArticleData(jsonNode);
                            break;
                        case "Organization":
                            extractOrganizationData(jsonNode);
                            break;
                        default:
                            System.out.println("Unknown type: " + type);
                    }
                }
            } catch (Exception e) {
                System.err.println("Error parsing JSON-LD: " + e.getMessage());
            }
        }
    }

    private static void extractProductData(JsonNode product) {
        String name = product.path("name").asText();
        String description = product.path("description").asText();
        String brand = product.path("brand").path("name").asText();

        JsonNode offers = product.path("offers");
        String price = offers.path("price").asText();
        String currency = offers.path("priceCurrency").asText();

        System.out.printf("Product: %s%nBrand: %s%nPrice: %s %s%nDescription: %s%n", 
                         name, brand, price, currency, description);
    }

    private static void extractArticleData(JsonNode article) {
        String headline = article.path("headline").asText();
        String author = article.path("author").path("name").asText();
        String datePublished = article.path("datePublished").asText();

        System.out.printf("Article: %s%nAuthor: %s%nPublished: %s%n", 
                         headline, author, datePublished);
    }

    private static void extractOrganizationData(JsonNode org) {
        String name = org.path("name").asText();
        String url = org.path("url").asText();
        String description = org.path("description").asText();

        System.out.printf("Organization: %s%nURL: %s%nDescription: %s%n", 
                         name, url, description);
    }
}

Extracting Microdata

Microdata requires parsing HTML attributes to extract structured information:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.HashMap;
import java.util.Map;

public class MicrodataExtractor {
    public static void main(String[] args) throws Exception {
        String url = "https://example.com/microdata-page";
        Document doc = Jsoup.connect(url).get();

        // Find all elements with itemscope
        Elements itemScopes = doc.select("[itemscope]");

        for (Element scope : itemScopes) {
            String itemType = scope.attr("itemtype");
            Map<String, String> properties = new HashMap<>();

            // Extract properties from this scope
            Elements props = scope.select("[itemprop]");

            for (Element prop : props) {
                String propertyName = prop.attr("itemprop");
                String propertyValue = extractPropertyValue(prop);
                properties.put(propertyName, propertyValue);
            }

            System.out.println("ItemType: " + itemType);
            properties.forEach((key, value) -> 
                System.out.println("  " + key + ": " + value));
            System.out.println();
        }
    }

    private static String extractPropertyValue(Element element) {
        // Check for specific value attributes first
        if (element.hasAttr("content")) {
            return element.attr("content");
        } else if (element.hasAttr("datetime")) {
            return element.attr("datetime");
        } else if (element.hasAttr("href")) {
            return element.attr("href");
        } else if (element.hasAttr("src")) {
            return element.attr("src");
        } else {
            // Fall back to text content
            return element.text().trim();
        }
    }
}

Advanced Microdata Extraction with Nested Items

Handle complex microdata structures with nested items:

import java.util.ArrayList;
import java.util.List;

public class AdvancedMicrodataExtractor {
    public static class MicrodataItem {
        private String type;
        private Map<String, Object> properties;

        public MicrodataItem(String type) {
            this.type = type;
            this.properties = new HashMap<>();
        }

        // Getters and setters
        public String getType() { return type; }
        public Map<String, Object> getProperties() { return properties; }
    }

    public static void main(String[] args) throws Exception {
        String url = "https://example.com/complex-microdata";
        Document doc = Jsoup.connect(url).get();

        Elements topLevelScopes = doc.select("[itemscope]:not([itemscope] [itemscope])");

        for (Element scope : topLevelScopes) {
            MicrodataItem item = extractMicrodataItem(scope);
            System.out.println("Extracted item: " + item.getType());
            printProperties(item.getProperties(), 0);
        }
    }

    private static MicrodataItem extractMicrodataItem(Element scope) {
        String itemType = scope.attr("itemtype");
        MicrodataItem item = new MicrodataItem(itemType);

        Elements directProps = scope.select("> [itemprop], [itemprop]:not([itemscope] [itemprop])");

        for (Element prop : directProps) {
            String propName = prop.attr("itemprop");

            if (prop.hasAttr("itemscope")) {
                // Nested microdata item
                MicrodataItem nestedItem = extractMicrodataItem(prop);
                item.getProperties().put(propName, nestedItem);
            } else {
                // Simple property
                String value = extractPropertyValue(prop);
                item.getProperties().put(propName, value);
            }
        }

        return item;
    }

    private static void printProperties(Map<String, Object> properties, int indent) {
        String indentStr = "  ".repeat(indent);

        for (Map.Entry<String, Object> entry : properties.entrySet()) {
            if (entry.getValue() instanceof MicrodataItem) {
                MicrodataItem nested = (MicrodataItem) entry.getValue();
                System.out.println(indentStr + entry.getKey() + " (" + nested.getType() + "):");
                printProperties(nested.getProperties(), indent + 1);
            } else {
                System.out.println(indentStr + entry.getKey() + ": " + entry.getValue());
            }
        }
    }

    private static String extractPropertyValue(Element element) {
        if (element.hasAttr("content")) return element.attr("content");
        if (element.hasAttr("datetime")) return element.attr("datetime");
        if (element.hasAttr("href")) return element.attr("href");
        if (element.hasAttr("src")) return element.attr("src");
        return element.text().trim();
    }
}

Extracting OpenGraph and Meta Tags

OpenGraph meta tags provide social media-friendly structured data:

public class MetaTagExtractor {
    public static void main(String[] args) throws Exception {
        String url = "https://example.com/social-page";
        Document doc = Jsoup.connect(url).get();

        // Extract OpenGraph tags
        Map<String, String> openGraph = new HashMap<>();
        Elements ogTags = doc.select("meta[property^=og:]");

        for (Element tag : ogTags) {
            String property = tag.attr("property").substring(3); // Remove "og:" prefix
            String content = tag.attr("content");
            openGraph.put(property, content);
        }

        // Extract Twitter Card tags
        Map<String, String> twitterCard = new HashMap<>();
        Elements twitterTags = doc.select("meta[name^=twitter:]");

        for (Element tag : twitterTags) {
            String name = tag.attr("name").substring(8); // Remove "twitter:" prefix
            String content = tag.attr("content");
            twitterCard.put(name, content);
        }

        // Extract standard meta tags
        Map<String, String> metaTags = new HashMap<>();
        Elements standardMeta = doc.select("meta[name]");

        for (Element tag : standardMeta) {
            String name = tag.attr("name");
            String content = tag.attr("content");
            if (!name.startsWith("twitter:")) {
                metaTags.put(name, content);
            }
        }

        System.out.println("OpenGraph Data:");
        openGraph.forEach((key, value) -> System.out.println("  og:" + key + " = " + value));

        System.out.println("\nTwitter Card Data:");
        twitterCard.forEach((key, value) -> System.out.println("  twitter:" + key + " = " + value));

        System.out.println("\nStandard Meta Tags:");
        metaTags.forEach((key, value) -> System.out.println("  " + key + " = " + value));
    }
}

Complete Structured Data Extractor

Here's a comprehensive extractor that handles multiple structured data formats:

import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class UniversalStructuredDataExtractor {
    private final ObjectMapper jsonMapper;
    private final ExecutorService executor;

    public UniversalStructuredDataExtractor() {
        this.jsonMapper = new ObjectMapper();
        this.executor = Executors.newFixedThreadPool(4);
    }

    public void extractAllStructuredData(String url) throws Exception {
        Document doc = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (compatible; StructuredDataBot/1.0)")
                .timeout(10000)
                .get();

        // Extract different types of structured data concurrently
        CompletableFuture<Void> jsonLdFuture = CompletableFuture.runAsync(() -> {
            try {
                extractJsonLd(doc);
            } catch (Exception e) {
                System.err.println("JSON-LD extraction failed: " + e.getMessage());
            }
        }, executor);

        CompletableFuture<Void> microdataFuture = CompletableFuture.runAsync(() -> {
            extractMicrodata(doc);
        }, executor);

        CompletableFuture<Void> metaFuture = CompletableFuture.runAsync(() -> {
            extractMetaTags(doc);
        }, executor);

        CompletableFuture<Void> rdFaFuture = CompletableFuture.runAsync(() -> {
            extractRDFa(doc);
        }, executor);

        // Wait for all extractions to complete
        CompletableFuture.allOf(jsonLdFuture, microdataFuture, metaFuture, rdFaFuture).join();
    }

    private void extractJsonLd(Document doc) throws Exception {
        Elements scripts = doc.select("script[type=application/ld+json]");
        System.out.println("=== JSON-LD Data ===");

        for (Element script : scripts) {
            try {
                JsonNode json = jsonMapper.readTree(script.html());
                System.out.println(jsonMapper.writerWithDefaultPrettyPrinter().writeValueAsString(json));
            } catch (Exception e) {
                System.err.println("Failed to parse JSON-LD: " + e.getMessage());
            }
        }
    }

    private void extractMicrodata(Document doc) {
        Elements scopes = doc.select("[itemscope]");
        System.out.println("\n=== Microdata ===");

        for (Element scope : scopes) {
            if (!scope.parents().select("[itemscope]").isEmpty()) {
                continue; // Skip nested items, they'll be handled recursively
            }

            MicrodataItem item = extractMicrodataItem(scope);
            System.out.println("Type: " + item.getType());
            printProperties(item.getProperties(), 1);
        }
    }

    private void extractMetaTags(Document doc) {
        System.out.println("\n=== Meta Tags ===");

        // OpenGraph
        Elements ogTags = doc.select("meta[property^=og:]");
        if (!ogTags.isEmpty()) {
            System.out.println("OpenGraph:");
            ogTags.forEach(tag -> System.out.println("  " + tag.attr("property") + " = " + tag.attr("content")));
        }

        // Twitter Cards
        Elements twitterTags = doc.select("meta[name^=twitter:]");
        if (!twitterTags.isEmpty()) {
            System.out.println("Twitter Cards:");
            twitterTags.forEach(tag -> System.out.println("  " + tag.attr("name") + " = " + tag.attr("content")));
        }

        // Standard meta tags
        Elements metaTags = doc.select("meta[name]:not([name^=twitter:])");
        if (!metaTags.isEmpty()) {
            System.out.println("Standard Meta:");
            metaTags.forEach(tag -> System.out.println("  " + tag.attr("name") + " = " + tag.attr("content")));
        }
    }

    private void extractRDFa(Document doc) {
        System.out.println("\n=== RDFa Data ===");

        Elements rdFaElements = doc.select("[typeof], [property]");
        for (Element element : rdFaElements) {
            if (element.hasAttr("typeof")) {
                System.out.println("Type: " + element.attr("typeof"));
            }
            if (element.hasAttr("property")) {
                String property = element.attr("property");
                String content = element.hasAttr("content") ? 
                    element.attr("content") : element.text();
                System.out.println("  " + property + " = " + content);
            }
        }
    }

    public void shutdown() {
        executor.shutdown();
    }
}

Best Practices and Error Handling

Robust JSON-LD Parsing

When parsing JSON-LD, always handle malformed JSON gracefully:

private static List<JsonNode> parseJsonLdSafely(Document doc) {
    List<JsonNode> results = new ArrayList<>();
    Elements scripts = doc.select("script[type=application/ld+json]");

    for (Element script : scripts) {
        String content = script.html().trim();
        if (content.isEmpty()) continue;

        try {
            // Handle both single objects and arrays
            JsonNode node = jsonMapper.readTree(content);
            if (node.isArray()) {
                node.forEach(results::add);
            } else {
                results.add(node);
            }
        } catch (Exception e) {
            System.err.println("Skipping malformed JSON-LD: " + e.getMessage());
            // Log the problematic content for debugging
            System.err.println("Content: " + content.substring(0, Math.min(100, content.length())));
        }
    }

    return results;
}

Performance Optimization

For large-scale scraping, optimize your extraction process:

public class OptimizedExtractor {
    private static final int TIMEOUT_MS = 10000;
    private static final String USER_AGENT = "Mozilla/5.0 (compatible; DataExtractor/1.0)";

    public StructuredData extractWithCache(String url, boolean useCache) throws Exception {
        // Implement caching logic here
        if (useCache) {
            StructuredData cached = getCachedData(url);
            if (cached != null) return cached;
        }

        Document doc = Jsoup.connect(url)
                .userAgent(USER_AGENT)
                .timeout(TIMEOUT_MS)
                .followRedirects(true)
                .maxBodySize(1024 * 1024) // 1MB limit
                .get();

        StructuredData data = new StructuredData();

        // Extract only what you need
        data.setJsonLd(extractJsonLdData(doc));
        data.setMicrodata(extractMicrodataData(doc));
        data.setMetaTags(extractMetaData(doc));

        if (useCache) {
            cacheData(url, data);
        }

        return data;
    }
}

Integration with Modern Web Applications

When working with JavaScript-heavy sites that dynamically load structured data, consider combining jsoup with other tools. While jsoup handles static HTML efficiently, some websites require JavaScript execution to populate structured data.

For dynamic content, you might need to first render the page with a headless browser before using jsoup to parse the resulting HTML. This approach ensures you capture all structured data, including that loaded via AJAX requests.

Conclusion

Jsoup provides powerful capabilities for extracting structured data from web pages. By combining JSON-LD parsing, microdata extraction, and meta tag analysis, you can build comprehensive data extraction systems. Remember to handle errors gracefully, implement proper caching for performance, and always respect robots.txt and rate limiting when scraping at scale.

The techniques shown here form the foundation for building robust web scraping applications that can extract rich, structured information from modern websites. Whether you're building a price monitoring system, content aggregator, or SEO analysis tool, these structured data extraction methods will help you gather the precise information you need.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon