Table of contents

How can I scrape data from single-page applications (SPAs) using PHP?

Scraping Single-Page Applications (SPAs) with PHP presents unique challenges because SPAs rely heavily on JavaScript to dynamically render content after the initial page load. Traditional PHP scraping methods like cURL and DOMDocument only retrieve the static HTML, missing the dynamic content generated by JavaScript frameworks like React, Vue.js, or Angular.

Understanding the SPA Challenge

SPAs load minimal HTML initially and use JavaScript to fetch data and render content dynamically. When you use standard PHP scraping tools, you'll typically see:

<!DOCTYPE html>
<html>
<head>
    <title>My SPA</title>
</head>
<body>
    <div id="root"></div>
    <script src="app.js"></script>
</body>
</html>

The actual content you need is rendered by JavaScript after the page loads, making it invisible to traditional scraping methods.

Method 1: Using Headless Browsers with PuPHPeteer

The most effective approach is using a headless browser that can execute JavaScript. PuPHPeteer is a PHP library that provides a wrapper around Puppeteer:

Installation

composer require rialto-php/puphpeteer
npm install puppeteer

Basic SPA Scraping Example

<?php
require_once 'vendor/autoload.php';

use Rialto\PuPHPeteer\PuppeteerInterface;
use Rialto\PuPHPeteer\Puppeteer;

class SPAScraper
{
    private $puppeteer;

    public function __construct()
    {
        $this->puppeteer = new Puppeteer();
    }

    public function scrapeSPA($url, $waitForSelector = null)
    {
        try {
            $browser = $this->puppeteer->launch([
                'headless' => true,
                'args' => ['--no-sandbox', '--disable-setuid-sandbox']
            ]);

            $page = $browser->newPage();

            // Set viewport for consistent rendering
            $page->setViewport([
                'width' => 1280,
                'height' => 720
            ]);

            // Navigate to the SPA
            $page->goto($url, [
                'waitUntil' => 'networkidle2',
                'timeout' => 30000
            ]);

            // Wait for specific content to load if selector provided
            if ($waitForSelector) {
                $page->waitForSelector($waitForSelector, [
                    'timeout' => 10000
                ]);
            }

            // Get the rendered HTML
            $content = $page->content();

            $browser->close();

            return $content;

        } catch (Exception $e) {
            throw new Exception("SPA scraping failed: " . $e->getMessage());
        }
    }

    public function extractDataFromSPA($url, $selectors)
    {
        try {
            $browser = $this->puppeteer->launch(['headless' => true]);
            $page = $browser->newPage();

            $page->goto($url, ['waitUntil' => 'networkidle2']);

            $data = [];

            foreach ($selectors as $key => $selector) {
                // Wait for element and extract text
                try {
                    $page->waitForSelector($selector, ['timeout' => 5000]);
                    $element = $page->querySelector($selector);
                    $data[$key] = $page->evaluate('element => element.textContent', $element);
                } catch (Exception $e) {
                    $data[$key] = null;
                }
            }

            $browser->close();

            return $data;

        } catch (Exception $e) {
            throw new Exception("Data extraction failed: " . $e->getMessage());
        }
    }
}

// Usage example
$scraper = new SPAScraper();

// Scrape a React application
$selectors = [
    'title' => 'h1.main-title',
    'description' => '.product-description',
    'price' => '.price-display',
    'availability' => '.stock-status'
];

$data = $scraper->extractDataFromSPA('https://example-spa.com/product/123', $selectors);
print_r($data);

Method 2: Using Goutte with JavaScript Support

Goutte is a popular PHP web scraper, but it requires additional setup for JavaScript execution:

<?php
require_once 'vendor/autoload.php';

use Goutte\Client;
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

class SPAGoutte
{
    private $client;

    public function __construct()
    {
        // Note: Goutte alone cannot execute JavaScript
        // This example shows the limitation
        $this->client = new Client(HttpClient::create([
            'timeout' => 30,
            'headers' => [
                'User-Agent' => 'Mozilla/5.0 (compatible; PHP Scraper)'
            ]
        ]));
    }

    public function scrapeWithFallback($url)
    {
        // First, try to get static content
        $crawler = $this->client->request('GET', $url);

        // Check if content is dynamically loaded
        $scriptTags = $crawler->filter('script')->count();
        $divCount = $crawler->filter('div')->count();

        if ($scriptTags > 5 && $divCount < 5) {
            // Likely an SPA, need different approach
            throw new Exception("This appears to be an SPA. Use headless browser approach.");
        }

        return $crawler;
    }
}

Method 3: Chrome/Chromium via Shell Commands

For environments where installing Node.js dependencies is challenging, you can control Chrome directly:

<?php

class ChromeHeadless
{
    private $chromePath;

    public function __construct($chromePath = '/usr/bin/google-chrome')
    {
        $this->chromePath = $chromePath;
    }

    public function scrapeSPA($url, $waitTime = 5)
    {
        // Create temporary files
        $tempHtml = tempnam(sys_get_temp_dir(), 'spa_content');
        $tempPdf = tempnam(sys_get_temp_dir(), 'spa_pdf');

        // Chrome command with necessary flags
        $command = sprintf(
            '%s --headless --disable-gpu --disable-software-rasterizer ' .
            '--disable-dev-shm-usage --no-sandbox --dump-dom ' .
            '--virtual-time-budget=%d "%s" > %s 2>/dev/null',
            escapeshellarg($this->chromePath),
            $waitTime * 1000, // Convert to milliseconds
            escapeshellarg($url),
            escapeshellarg($tempHtml)
        );

        exec($command, $output, $returnCode);

        if ($returnCode !== 0) {
            unlink($tempHtml);
            throw new Exception("Chrome execution failed");
        }

        $content = file_get_contents($tempHtml);
        unlink($tempHtml);

        return $content;
    }

    public function scrapeWithScreenshot($url, $screenshotPath = null)
    {
        $screenshot = $screenshotPath ?: tempnam(sys_get_temp_dir(), 'spa_screenshot') . '.png';

        $command = sprintf(
            '%s --headless --disable-gpu --window-size=1280,720 ' .
            '--screenshot=%s "%s" 2>/dev/null',
            escapeshellarg($this->chromePath),
            escapeshellarg($screenshot),
            escapeshellarg($url)
        );

        exec($command, $output, $returnCode);

        return $returnCode === 0 ? $screenshot : false;
    }
}

// Usage
$chrome = new ChromeHeadless();
$content = $chrome->scrapeSPA('https://example-spa.com', 3);

// Parse with DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($content);
$xpath = new DOMXPath($dom);

$titles = $xpath->query('//h1[@class="title"]');
foreach ($titles as $title) {
    echo $title->textContent . "\n";
}

Method 4: API Endpoint Discovery

Many SPAs load data through API calls. You can intercept these calls and scrape the APIs directly:

<?php

class SPAApiScraper
{
    private $session;

    public function __construct()
    {
        $this->session = curl_init();
        curl_setopt_array($this->session, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_SSL_VERIFYPEER => false,
            CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; API Scraper)'
        ]);
    }

    public function discoverApiEndpoints($url)
    {
        // Use browser automation to monitor network requests
        // This is a simplified example - you'd need Puppeteer for full implementation
        $content = $this->getPageContent($url);

        // Look for common API patterns in JavaScript
        preg_match_all('/(?:fetch|axios|XMLHttpRequest)[^"\']*["\']([^"\']*api[^"\']*)["\']/', $content, $matches);

        return array_unique($matches[1] ?? []);
    }

    public function scrapeApiEndpoint($apiUrl, $headers = [])
    {
        curl_setopt($this->session, CURLOPT_URL, $apiUrl);
        curl_setopt($this->session, CURLOPT_HTTPHEADER, array_merge([
            'Accept: application/json',
            'Content-Type: application/json'
        ], $headers));

        $response = curl_exec($this->session);
        $httpCode = curl_getinfo($this->session, CURLINFO_HTTP_CODE);

        if ($httpCode !== 200) {
            throw new Exception("API request failed with code: $httpCode");
        }

        return json_decode($response, true);
    }

    private function getPageContent($url)
    {
        curl_setopt($this->session, CURLOPT_URL, $url);
        return curl_exec($this->session);
    }

    public function __destruct()
    {
        curl_close($this->session);
    }
}

// Usage example
$apiScraper = new SPAApiScraper();

// Direct API scraping if you know the endpoint
$productData = $apiScraper->scrapeApiEndpoint(
    'https://api.example.com/products/123',
    ['Authorization: Bearer your-token-here']
);

echo json_encode($productData, JSON_PRETTY_PRINT);

Advanced Techniques for Complex SPAs

Handling Authentication

Many SPAs require authentication. Here's how to handle login flows:

<?php

class AuthenticatedSPAScraper
{
    private $puppeteer;

    public function loginAndScrape($loginUrl, $credentials, $targetUrl)
    {
        $browser = $this->puppeteer->launch(['headless' => false]); // Use false for debugging
        $page = $browser->newPage();

        // Navigate to login page
        $page->goto($loginUrl);

        // Fill login form
        $page->waitForSelector('#username');
        $page->type('#username', $credentials['username']);
        $page->type('#password', $credentials['password']);

        // Submit form and wait for navigation
        $page->click('#login-button');
        $page->waitForNavigation(['waitUntil' => 'networkidle2']);

        // Now navigate to target page
        $page->goto($targetUrl);
        $page->waitForSelector('.protected-content');

        $content = $page->content();
        $browser->close();

        return $content;
    }
}

Handling Infinite Scroll

For SPAs with infinite scroll, you need to simulate scrolling:

public function scrapeInfiniteScroll($url, $scrollCount = 5)
{
    $browser = $this->puppeteer->launch(['headless' => true]);
    $page = $browser->newPage();

    $page->goto($url);
    $page->waitForSelector('.content-item');

    // Scroll and wait for new content
    for ($i = 0; $i < $scrollCount; $i++) {
        $page->evaluate('window.scrollTo(0, document.body.scrollHeight)');
        $page->waitFor(2000); // Wait for new content to load
    }

    $content = $page->content();
    $browser->close();

    return $content;
}

Performance Optimization Tips

  1. Reuse Browser Instances: Keep browsers open between requests when scraping multiple pages
  2. Disable Images: Speed up loading by disabling image loading
  3. Use Request Interception: Block unnecessary resources
public function optimizedScraping($urls)
{
    $browser = $this->puppeteer->launch([
        'headless' => true,
        'args' => ['--no-images', '--disable-javascript', '--disable-plugins']
    ]);

    $results = [];

    foreach ($urls as $url) {
        $page = $browser->newPage();

        // Block images and other resources
        $page->setRequestInterception(true);
        $page->on('request', function($request) {
            if (in_array($request->resourceType(), ['image', 'stylesheet', 'font'])) {
                $request->abort();
            } else {
                $request->continue();
            }
        });

        $page->goto($url);
        $results[] = $page->content();
        $page->close();
    }

    $browser->close();
    return $results;
}

Best Practices and Considerations

  1. Respect robots.txt: Always check the website's robots.txt file
  2. Implement Rate Limiting: Avoid overwhelming servers with too many requests
  3. Handle Errors Gracefully: SPAs can be unpredictable; implement robust error handling
  4. Monitor Memory Usage: Headless browsers can consume significant memory
  5. Keep Dependencies Updated: Browser automation tools evolve rapidly

When working with complex SPAs, you might also want to learn about how to crawl a single page application (SPA) using Puppeteer for more specialized techniques, or explore how to handle AJAX requests using Puppeteer to better understand the underlying mechanics of SPA data loading.

Conclusion

Scraping SPAs with PHP requires moving beyond traditional HTTP clients to browser automation tools. While this adds complexity, it provides access to the full rendered content that users see. Choose the method that best fits your environment constraints and scraping requirements. For production applications, consider using dedicated scraping services or APIs when available, as they often provide more reliable and faster access to data than rendering full JavaScript applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon