Table of contents

How do I parse and extract data from JSON-LD or structured data using Beautiful Soup?

JSON-LD (JavaScript Object Notation for Linked Data) is a structured data format commonly used by websites to provide machine-readable information about their content. This data is particularly valuable for web scraping as it often contains clean, organized information about products, articles, events, and other entities. Beautiful Soup, combined with Python's JSON library, provides an excellent toolkit for extracting and parsing this structured data.

Understanding JSON-LD Structure

JSON-LD data is typically embedded in HTML pages within <script> tags with the type application/ld+json. This structured data follows Schema.org vocabulary and can contain rich information about:

  • Products (price, availability, reviews)
  • Articles (author, publication date, content)
  • Events (date, location, organizer)
  • Organizations (contact info, social profiles)
  • Breadcrumbs and navigation data

Basic JSON-LD Extraction with Beautiful Soup

Here's a fundamental approach to extracting JSON-LD data from web pages:

import requests
from bs4 import BeautifulSoup
import json

def extract_json_ld(url):
    """Extract all JSON-LD data from a webpage"""
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all script tags with JSON-LD type
    json_scripts = soup.find_all('script', type='application/ld+json')

    json_data = []
    for script in json_scripts:
        try:
            data = json.loads(script.string)
            json_data.append(data)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON-LD: {e}")
            continue

    return json_data

# Usage example
url = "https://example.com/product-page"
structured_data = extract_json_ld(url)
print(json.dumps(structured_data, indent=2))

Advanced JSON-LD Parsing Techniques

Handling Multiple JSON-LD Blocks

Many websites include multiple JSON-LD blocks for different types of data. Here's how to handle and categorize them:

def categorize_json_ld(json_data_list):
    """Categorize JSON-LD data by type"""
    categorized = {
        'Product': [],
        'Article': [],
        'Organization': [],
        'BreadcrumbList': [],
        'Other': []
    }

    for data in json_data_list:
        if isinstance(data, list):
            # Handle arrays of JSON-LD objects
            for item in data:
                schema_type = item.get('@type', 'Other')
                if schema_type in categorized:
                    categorized[schema_type].append(item)
                else:
                    categorized['Other'].append(item)
        else:
            schema_type = data.get('@type', 'Other')
            if schema_type in categorized:
                categorized[schema_type].append(data)
            else:
                categorized['Other'].append(data)

    return categorized

Extracting Specific Product Information

For e-commerce sites, product data extraction is particularly valuable:

def extract_product_data(url):
    """Extract product-specific data from JSON-LD"""
    json_data = extract_json_ld(url)
    categorized = categorize_json_ld(json_data)

    products = []
    for product in categorized.get('Product', []):
        product_info = {
            'name': product.get('name'),
            'description': product.get('description'),
            'price': None,
            'currency': None,
            'availability': product.get('offers', {}).get('availability'),
            'brand': product.get('brand', {}).get('name'),
            'sku': product.get('sku'),
            'image': product.get('image'),
            'rating': None,
            'review_count': None
        }

        # Extract price information
        offers = product.get('offers', {})
        if isinstance(offers, list):
            offers = offers[0]  # Take first offer

        product_info['price'] = offers.get('price')
        product_info['currency'] = offers.get('priceCurrency')

        # Extract rating information
        rating_data = product.get('aggregateRating', {})
        if rating_data:
            product_info['rating'] = rating_data.get('ratingValue')
            product_info['review_count'] = rating_data.get('reviewCount')

        products.append(product_info)

    return products

Handling Complex Nested Structures

JSON-LD data can contain deeply nested objects. Here's how to navigate complex structures:

def safe_get_nested(data, keys, default=None):
    """Safely get nested dictionary values"""
    for key in keys:
        if isinstance(data, dict) and key in data:
            data = data[key]
        else:
            return default
    return data

def extract_article_data(url):
    """Extract article data with nested author information"""
    json_data = extract_json_ld(url)
    categorized = categorize_json_ld(json_data)

    articles = []
    for article in categorized.get('Article', []):
        article_info = {
            'headline': article.get('headline'),
            'description': article.get('description'),
            'date_published': article.get('datePublished'),
            'date_modified': article.get('dateModified'),
            'author_name': safe_get_nested(article, ['author', 'name']),
            'author_url': safe_get_nested(article, ['author', 'url']),
            'publisher_name': safe_get_nested(article, ['publisher', 'name']),
            'publisher_logo': safe_get_nested(article, ['publisher', 'logo', 'url']),
            'image': article.get('image'),
            'word_count': article.get('wordCount')
        }

        # Handle author as array
        author = article.get('author')
        if isinstance(author, list) and author:
            article_info['author_name'] = author[0].get('name')
            article_info['author_url'] = author[0].get('url')

        articles.append(article_info)

    return articles

Error Handling and Validation

Robust JSON-LD extraction requires proper error handling:

import logging

def robust_json_ld_extraction(url, timeout=10):
    """Extract JSON-LD with comprehensive error handling"""
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=timeout)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        json_scripts = soup.find_all('script', type='application/ld+json')

        if not json_scripts:
            logging.warning(f"No JSON-LD found on {url}")
            return []

        valid_data = []
        for i, script in enumerate(json_scripts):
            if not script.string:
                logging.warning(f"Empty JSON-LD script tag {i} on {url}")
                continue

            try:
                # Clean the JSON string
                json_str = script.string.strip()
                data = json.loads(json_str)

                # Validate basic structure
                if isinstance(data, dict) and '@type' in data:
                    valid_data.append(data)
                elif isinstance(data, list):
                    valid_data.extend([item for item in data if isinstance(item, dict) and '@type' in item])
                else:
                    logging.warning(f"Invalid JSON-LD structure in script {i} on {url}")

            except json.JSONDecodeError as e:
                logging.error(f"JSON decode error in script {i} on {url}: {e}")
                continue

        return valid_data

    except requests.RequestException as e:
        logging.error(f"Request error for {url}: {e}")
        return []
    except Exception as e:
        logging.error(f"Unexpected error processing {url}: {e}")
        return []

JavaScript Implementation

For developers working with Node.js, here's how to extract JSON-LD using JavaScript:

const axios = require('axios');
const cheerio = require('cheerio');

async function extractJsonLD(url) {
    try {
        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
        });

        const $ = cheerio.load(response.data);
        const jsonLDElements = $('script[type="application/ld+json"]');

        const jsonData = [];
        jsonLDElements.each((index, element) => {
            try {
                const jsonText = $(element).html();
                if (jsonText) {
                    const data = JSON.parse(jsonText);
                    jsonData.push(data);
                }
            } catch (error) {
                console.error(`Error parsing JSON-LD at index ${index}:`, error);
            }
        });

        return jsonData;
    } catch (error) {
        console.error('Error fetching or parsing page:', error);
        return [];
    }
}

// Usage
extractJsonLD('https://example.com/product')
    .then(data => console.log(JSON.stringify(data, null, 2)));

Working with Schema.org Types

Different Schema.org types require specific extraction approaches. Here's a comprehensive extractor:

class SchemaExtractor:
    def __init__(self, url):
        self.url = url
        self.json_data = robust_json_ld_extraction(url)

    def extract_breadcrumbs(self):
        """Extract breadcrumb navigation data"""
        breadcrumbs = []
        for data in self.json_data:
            if data.get('@type') == 'BreadcrumbList':
                items = data.get('itemListElement', [])
                for item in items:
                    breadcrumbs.append({
                        'name': item.get('name'),
                        'url': item.get('item'),
                        'position': item.get('position')
                    })
        return sorted(breadcrumbs, key=lambda x: x.get('position', 0))

    def extract_organization(self):
        """Extract organization information"""
        for data in self.json_data:
            if data.get('@type') == 'Organization':
                return {
                    'name': data.get('name'),
                    'url': data.get('url'),
                    'logo': data.get('logo'),
                    'contact_point': data.get('contactPoint', {}),
                    'social_media': data.get('sameAs', []),
                    'address': data.get('address', {})
                }
        return None

    def extract_events(self):
        """Extract event information"""
        events = []
        for data in self.json_data:
            if data.get('@type') == 'Event':
                events.append({
                    'name': data.get('name'),
                    'description': data.get('description'),
                    'start_date': data.get('startDate'),
                    'end_date': data.get('endDate'),
                    'location': data.get('location', {}),
                    'organizer': data.get('organizer', {}),
                    'offers': data.get('offers', [])
                })
        return events

Performance Optimization

For large-scale scraping operations, consider these optimizations:

import concurrent.futures
from functools import lru_cache

@lru_cache(maxsize=128)
def cached_extraction(url):
    """Cache JSON-LD extraction results"""
    return robust_json_ld_extraction(url)

def batch_extract_json_ld(urls, max_workers=5):
    """Extract JSON-LD from multiple URLs concurrently"""
    results = {}

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {executor.submit(cached_extraction, url): url for url in urls}

        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
                results[url] = data
            except Exception as exc:
                logging.error(f'URL {url} generated an exception: {exc}')
                results[url] = []

    return results

Best Practices and Tips

  1. Always validate JSON structure before processing to avoid runtime errors
  2. Handle missing fields gracefully using safe dictionary access methods
  3. Respect robots.txt and implement proper rate limiting
  4. Use appropriate headers to avoid being blocked by anti-bot measures
  5. Cache results when processing multiple pages from the same domain
  6. Log extraction activities for debugging and monitoring purposes

Alternative Approaches

While Beautiful Soup is excellent for JSON-LD extraction, consider these alternatives for specific use cases:

JSON-LD extraction with Beautiful Soup provides a powerful method for accessing structured data from websites. By following these patterns and best practices, you can build robust scrapers that efficiently extract valuable information from web pages while handling the complexities and edge cases commonly encountered in real-world scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon