Table of contents

How do I use Simple HTML DOM to scrape content inside an iframe?

Scraping iframe Content with Simple HTML DOM

Scraping content inside an iframe using Simple HTML DOM requires a two-step approach since iframes load content from separate URLs. Here's a comprehensive guide on how to do it effectively.

Understanding iframe Limitations

Server-side advantage: Unlike client-side JavaScript, PHP with Simple HTML DOM doesn't face same-origin policy restrictions. However, you must directly access the iframe's source URL rather than trying to parse it through the parent page.

Step-by-Step Process

1. Extract iframe Source URL

First, scrape the parent page to find the iframe's src attribute:

<?php
include('simple_html_dom.php');

// Load the parent page containing the iframe
$html = file_get_html('https://example.com/page-with-iframe.html');

// Find the iframe element
$iframe = $html->find('iframe', 0);

if ($iframe) {
    $iframe_src = $iframe->src;
    echo "iframe source: " . $iframe_src . "\n";
} else {
    echo "No iframe found\n";
}
?>

2. Handle Relative URLs

Convert relative URLs to absolute ones:

<?php
function makeAbsoluteUrl($base_url, $relative_url) {
    // If already absolute, return as-is
    if (preg_match('/^https?:\/\//i', $relative_url)) {
        return $relative_url;
    }

    // Parse base URL
    $base_parts = parse_url($base_url);
    $base_host = $base_parts['scheme'] . '://' . $base_parts['host'];

    // Handle different relative URL formats
    if (strpos($relative_url, '/') === 0) {
        // Absolute path: /path/to/iframe
        return $base_host . $relative_url;
    } else {
        // Relative path: path/to/iframe
        $base_path = dirname($base_parts['path']);
        return $base_host . rtrim($base_path, '/') . '/' . $relative_url;
    }
}

// Usage
$base_url = 'https://example.com/parent-page.html';
$iframe_src = '../content/iframe.html';
$absolute_url = makeAbsoluteUrl($base_url, $iframe_src);
?>

3. Complete iframe Scraping Example

Here's a robust implementation with error handling:

<?php
include('simple_html_dom.php');

function scrapeIframeContent($parent_url, $iframe_selector = 'iframe') {
    try {
        // Load parent page
        $parent_html = file_get_html($parent_url);
        if (!$parent_html) {
            throw new Exception("Failed to load parent page");
        }

        // Find iframe
        $iframe = $parent_html->find($iframe_selector, 0);
        if (!$iframe) {
            throw new Exception("iframe not found");
        }

        // Get iframe source URL
        $iframe_src = $iframe->src;
        if (empty($iframe_src)) {
            throw new Exception("iframe src attribute is empty");
        }

        // Make URL absolute if needed
        if (!preg_match('/^https?:\/\//i', $iframe_src)) {
            $base_parts = parse_url($parent_url);
            $base_host = $base_parts['scheme'] . '://' . $base_parts['host'];

            if (strpos($iframe_src, '/') === 0) {
                $iframe_src = $base_host . $iframe_src;
            } else {
                $base_path = dirname($base_parts['path']);
                $iframe_src = $base_host . rtrim($base_path, '/') . '/' . $iframe_src;
            }
        }

        // Load iframe content
        $iframe_html = file_get_html($iframe_src);
        if (!$iframe_html) {
            throw new Exception("Failed to load iframe content from: " . $iframe_src);
        }

        // Clean up
        $parent_html->clear();

        return $iframe_html;

    } catch (Exception $e) {
        echo "Error: " . $e->getMessage() . "\n";
        return false;
    }
}

// Usage example
$parent_url = 'https://example.com/page-with-iframe.html';
$iframe_content = scrapeIframeContent($parent_url);

if ($iframe_content) {
    // Extract specific data from iframe
    $title = $iframe_content->find('title', 0);
    if ($title) {
        echo "iframe title: " . $title->plaintext . "\n";
    }

    // Find all links in iframe
    $links = $iframe_content->find('a');
    foreach ($links as $link) {
        echo "Link: " . $link->href . " - " . $link->plaintext . "\n";
    }

    // Extract table data
    $table_rows = $iframe_content->find('table tr');
    foreach ($table_rows as $row) {
        $cells = $row->find('td');
        $row_data = [];
        foreach ($cells as $cell) {
            $row_data[] = trim($cell->plaintext);
        }
        if (!empty($row_data)) {
            echo "Row: " . implode(' | ', $row_data) . "\n";
        }
    }

    // Clean up
    $iframe_content->clear();
}
?>

Handling Multiple iframes

When dealing with multiple iframes on a page:

<?php
function scrapeMultipleIframes($parent_url) {
    $parent_html = file_get_html($parent_url);
    if (!$parent_html) return false;

    $iframes = $parent_html->find('iframe');
    $results = [];

    foreach ($iframes as $index => $iframe) {
        $iframe_src = $iframe->src;
        if (empty($iframe_src)) continue;

        // Make absolute URL
        if (!preg_match('/^https?:\/\//i', $iframe_src)) {
            $base_parts = parse_url($parent_url);
            $base_host = $base_parts['scheme'] . '://' . $base_parts['host'];
            $iframe_src = $base_host . '/' . ltrim($iframe_src, '/');
        }

        // Load iframe content
        $iframe_html = file_get_html($iframe_src);
        if ($iframe_html) {
            $results[$index] = [
                'url' => $iframe_src,
                'title' => $iframe_html->find('title', 0)->plaintext ?? '',
                'content' => $iframe_html
            ];
        }
    }

    $parent_html->clear();
    return $results;
}
?>

Common Issues and Solutions

1. Dynamic iframe Loading

If the iframe src is set via JavaScript, you'll need to use browser automation tools like Selenium instead of Simple HTML DOM.

2. Authentication Required

For password-protected iframes, you may need to handle cookies and sessions:

<?php
// Set up context with cookies
$context = stream_context_create([
    'http' => [
        'header' => "Cookie: session_id=abc123\r\n"
    ]
]);

$iframe_html = file_get_html($iframe_url, false, $context);
?>

3. Rate Limiting

Add delays between requests to avoid being blocked:

<?php
// Add delay between requests
sleep(1); // Wait 1 second
$iframe_html = file_get_html($iframe_url);
?>

Best Practices

  1. Error Handling: Always check if file_get_html() returns false
  2. Memory Management: Call $html->clear() to free memory
  3. Respect Rate Limits: Add delays between requests
  4. Check robots.txt: Ensure you're allowed to scrape the content
  5. Handle Timeouts: Set appropriate timeout values for slow-loading iframes

Modern Alternatives

Consider these more robust alternatives to Simple HTML DOM:

  • Goutte: Symfony-based web scraper with better error handling
  • Guzzle + DOMDocument: More control over HTTP requests
  • Roach: Modern PHP web scraping framework
  • Browser automation: Selenium, Puppeteer, or Playwright for JavaScript-heavy sites

Simple HTML DOM works well for basic iframe scraping, but modern tools offer better performance and reliability for complex scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon