Table of contents

How do you use Cheerio to extract data from embedded scripts or JSON?

Extracting data from embedded scripts and JSON within HTML documents is a common requirement in web scraping. Many websites embed structured data, configuration objects, or dynamic content within <script> tags. Cheerio, being a server-side jQuery implementation, provides powerful methods to locate and extract this embedded data efficiently.

Understanding Embedded Data Types

Modern websites commonly embed data in several formats:

  • JSON-LD structured data for SEO and schema markup
  • Inline JavaScript variables containing application state
  • Configuration objects for frameworks like React or Vue
  • Data attributes in script tags
  • Base64 encoded data within scripts

Basic Script Tag Extraction

The most straightforward approach involves selecting script tags and extracting their content:

const cheerio = require('cheerio');
const axios = require('axios');

async function extractScriptData(url) {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract all script tag contents
    const scriptContents = [];
    $('script').each((index, element) => {
        const scriptText = $(element).html();
        if (scriptText) {
            scriptContents.push(scriptText.trim());
        }
    });

    return scriptContents;
}

Extracting JSON-LD Structured Data

JSON-LD (JavaScript Object Notation for Linked Data) is commonly used for structured data markup:

function extractJsonLD(html) {
    const $ = cheerio.load(html);
    const jsonLDScripts = [];

    $('script[type="application/ld+json"]').each((index, element) => {
        try {
            const jsonText = $(element).html();
            const jsonData = JSON.parse(jsonText);
            jsonLDScripts.push(jsonData);
        } catch (error) {
            console.error('Failed to parse JSON-LD:', error);
        }
    });

    return jsonLDScripts;
}

// Usage example
const html = `
<html>
<head>
    <script type="application/ld+json">
    {
        "@context": "https://schema.org",
        "@type": "Product",
        "name": "Example Product",
        "price": "29.99",
        "availability": "InStock"
    }
    </script>
</head>
</html>`;

const structuredData = extractJsonLD(html);
console.log(structuredData[0].name); // "Example Product"

Extracting Inline JavaScript Variables

Many websites embed configuration or data as JavaScript variables:

function extractJavaScriptVariables(html, variableName) {
    const $ = cheerio.load(html);
    let extractedData = null;

    $('script').each((index, element) => {
        const scriptContent = $(element).html();
        if (scriptContent && scriptContent.includes(variableName)) {
            // Regular expression to match variable assignment
            const regex = new RegExp(`${variableName}\\s*=\\s*({.*?});`, 's');
            const match = scriptContent.match(regex);

            if (match) {
                try {
                    extractedData = JSON.parse(match[1]);
                } catch (error) {
                    console.error('Failed to parse JavaScript variable:', error);
                }
            }
        }
    });

    return extractedData;
}

// Example usage
const htmlWithJS = `
<script>
    var appConfig = {
        "apiUrl": "https://api.example.com",
        "version": "1.2.3",
        "features": ["feature1", "feature2"]
    };
</script>`;

const config = extractJavaScriptVariables(htmlWithJS, 'appConfig');
console.log(config.apiUrl); // "https://api.example.com"

Advanced Pattern Matching for Complex Data

For more complex scenarios, use sophisticated regular expressions and parsing strategies:

class ScriptDataExtractor {
    constructor(html) {
        this.$ = cheerio.load(html);
    }

    // Extract React props or state
    extractReactData() {
        const reactData = [];

        this.$('script').each((index, element) => {
            const content = this.$(element).html();
            if (content && content.includes('React')) {
                // Look for React.createElement or ReactDOM.render patterns
                const patterns = [
                    /React\.createElement\([^,]+,\s*({.*?})/gs,
                    /ReactDOM\.render\([^,]+,\s*({.*?})/gs
                ];

                patterns.forEach(pattern => {
                    const matches = content.match(pattern);
                    if (matches) {
                        matches.forEach(match => {
                            try {
                                const propsMatch = match.match(/{.*?}/s);
                                if (propsMatch) {
                                    reactData.push(propsMatch[0]);
                                }
                            } catch (error) {
                                console.error('Error parsing React data:', error);
                            }
                        });
                    }
                });
            }
        });

        return reactData;
    }

    // Extract data from specific script sources
    extractFromScriptSrc(srcPattern) {
        const scripts = [];

        this.$('script[src]').each((index, element) => {
            const src = this.$(element).attr('src');
            if (src && src.match(srcPattern)) {
                scripts.push({
                    src: src,
                    attributes: this.$(element).get(0).attribs
                });
            }
        });

        return scripts;
    }

    // Extract base64 encoded data
    extractBase64Data() {
        const base64Data = [];

        this.$('script').each((index, element) => {
            const content = this.$(element).html();
            if (content) {
                // Look for base64 patterns
                const base64Pattern = /data:([^;]+);base64,([A-Za-z0-9+/=]+)/g;
                let match;

                while ((match = base64Pattern.exec(content)) !== null) {
                    base64Data.push({
                        mimeType: match[1],
                        data: match[2],
                        decoded: Buffer.from(match[2], 'base64').toString('utf-8')
                    });
                }
            }
        });

        return base64Data;
    }
}

Handling Dynamic Content and State Management

Modern web applications often use state management libraries that embed initial state:

function extractReduxState(html) {
    const $ = cheerio.load(html);
    let reduxState = null;

    $('script').each((index, element) => {
        const content = $(element).html();
        if (content && content.includes('__REDUX_DEVTOOLS_EXTENSION__')) {
            // Common Redux state pattern
            const statePattern = /window\.__INITIAL_STATE__\s*=\s*({.*?});/s;
            const match = content.match(statePattern);

            if (match) {
                try {
                    reduxState = JSON.parse(match[1]);
                } catch (error) {
                    console.error('Failed to parse Redux state:', error);
                }
            }
        }
    });

    return reduxState;
}

Error Handling and Validation

Robust script extraction requires proper error handling:

function safeExtractScriptData(html, options = {}) {
    const { 
        validateJSON = true, 
        timeout = 5000,
        maxScriptSize = 1024 * 1024 // 1MB limit
    } = options;

    const $ = cheerio.load(html);
    const results = {
        jsonLD: [],
        inlineData: [],
        errors: []
    };

    $('script').each((index, element) => {
        try {
            const content = $(element).html();
            const type = $(element).attr('type');

            // Skip oversized scripts
            if (content && content.length > maxScriptSize) {
                results.errors.push(`Script ${index} exceeds size limit`);
                return;
            }

            if (type === 'application/ld+json' && content) {
                try {
                    const parsed = JSON.parse(content);
                    if (validateJSON && typeof parsed === 'object') {
                        results.jsonLD.push(parsed);
                    }
                } catch (jsonError) {
                    results.errors.push(`Invalid JSON-LD in script ${index}: ${jsonError.message}`);
                }
            } else if (content && !type) {
                // Regular script content
                results.inlineData.push({
                    index,
                    content: content.substring(0, 1000), // Truncate for safety
                    size: content.length
                });
            }
        } catch (error) {
            results.errors.push(`Error processing script ${index}: ${error.message}`);
        }
    });

    return results;
}

Integration with Web Scraping Workflows

When building comprehensive scraping solutions, consider integrating script extraction with other tools. For JavaScript-heavy applications, you might need to combine Cheerio with browser automation tools. How to handle AJAX requests using Puppeteer provides excellent guidance for scenarios where Cheerio alone isn't sufficient for dynamic content.

Best Practices and Performance Optimization

Memory Management

function optimizedScriptExtraction(html) {
    const $ = cheerio.load(html, {
        xmlMode: false,
        decodeEntities: false // Faster parsing
    });

    // Use streaming approach for large documents
    const results = new Map();

    $('script[type="application/ld+json"]').each((index, element) => {
        const content = $(element).html();
        if (content && content.length < 50000) { // Reasonable size limit
            try {
                results.set(index, JSON.parse(content));
            } catch (error) {
                console.warn(`Skipping malformed JSON-LD at index ${index}`);
            }
        }
    });

    return Array.from(results.values());
}

Selective Extraction

function selectiveScriptExtraction(html, selectors) {
    const $ = cheerio.load(html);
    const extracted = {};

    // Only process scripts matching specific criteria
    Object.entries(selectors).forEach(([key, selector]) => {
        extracted[key] = [];
        $(selector).each((index, element) => {
            const content = $(element).html();
            if (content) {
                extracted[key].push(content);
            }
        });
    });

    return extracted;
}

// Usage
const selectors = {
    jsonLD: 'script[type="application/ld+json"]',
    analytics: 'script[src*="analytics"]',
    config: 'script:contains("config")'
};

For complex single-page applications where script extraction becomes challenging, How to crawl a single page application (SPA) using Puppeteer offers alternative approaches that might be more suitable.

Conclusion

Extracting data from embedded scripts and JSON using Cheerio is a powerful technique for accessing structured data that's not readily available in standard HTML elements. By combining proper selectors, regular expressions, and error handling, you can reliably extract valuable information from modern web applications.

The key to successful script extraction lies in understanding the specific patterns used by target websites, implementing robust parsing logic, and maintaining proper error handling to ensure your scraping operations remain stable and efficient.

Remember to respect robots.txt files and rate limits when implementing these techniques in production environments, and always validate extracted data before using it in your applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon