Table of contents

How can I implement concurrent scraping using PHP?

Concurrent scraping allows you to process multiple URLs simultaneously, dramatically improving the performance of your web scraping operations. While PHP traditionally operates in a single-threaded manner, there are several effective approaches to implement concurrent scraping that can significantly reduce the time required to extract data from multiple sources.

Understanding Concurrent vs Sequential Scraping

Sequential scraping processes one URL at a time, waiting for each request to complete before moving to the next. This approach is inefficient when dealing with multiple URLs, as network latency and response times create bottlenecks. Concurrent scraping, on the other hand, initiates multiple requests simultaneously, allowing your script to work on other tasks while waiting for network responses.

Method 1: Using cURL Multi-Handle

The most straightforward approach to concurrent scraping in PHP is using cURL's multi-handle functionality. This method allows you to execute multiple HTTP requests in parallel without blocking.

Basic cURL Multi Implementation

<?php
function concurrentScrape($urls, $options = []) {
    $multiHandle = curl_multi_init();
    $curlHandles = [];
    $results = [];

    // Initialize individual cURL handles
    foreach ($urls as $index => $url) {
        $curlHandles[$index] = curl_init();

        curl_setopt_array($curlHandles[$index], [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_TIMEOUT => 30,
            CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; PHP Scraper)',
            CURLOPT_SSL_VERIFYPEER => false,
        ]);

        // Merge custom options if provided
        if (!empty($options)) {
            curl_setopt_array($curlHandles[$index], $options);
        }

        curl_multi_add_handle($multiHandle, $curlHandles[$index]);
    }

    // Execute all requests concurrently
    $running = null;
    do {
        curl_multi_exec($multiHandle, $running);
        curl_multi_select($multiHandle);
    } while ($running > 0);

    // Collect results
    foreach ($curlHandles as $index => $handle) {
        $results[$index] = [
            'content' => curl_multi_getcontent($handle),
            'info' => curl_getinfo($handle),
            'error' => curl_error($handle)
        ];

        curl_multi_remove_handle($multiHandle, $handle);
        curl_close($handle);
    }

    curl_multi_close($multiHandle);
    return $results;
}

// Example usage
$urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    'https://api.example.com/data'
];

$results = concurrentScrape($urls);

foreach ($results as $index => $result) {
    if (empty($result['error']) && $result['info']['http_code'] === 200) {
        echo "Success for URL $index: " . strlen($result['content']) . " bytes\n";
    } else {
        echo "Error for URL $index: " . $result['error'] . "\n";
    }
}
?>

Advanced cURL Multi with Rate Limiting

For production environments, you'll want to implement rate limiting and better error handling:

<?php
class ConcurrentScraper {
    private $maxConcurrent = 10;
    private $delay = 1000000; // 1 second in microseconds

    public function __construct($maxConcurrent = 10, $delayMicroseconds = 1000000) {
        $this->maxConcurrent = $maxConcurrent;
        $this->delay = $delayMicroseconds;
    }

    public function scrapeUrls($urls, $options = []) {
        $results = [];
        $urlChunks = array_chunk($urls, $this->maxConcurrent);

        foreach ($urlChunks as $chunk) {
            $chunkResults = $this->processBatch($chunk, $options);
            $results = array_merge($results, $chunkResults);

            // Rate limiting between batches
            if (count($urlChunks) > 1) {
                usleep($this->delay);
            }
        }

        return $results;
    }

    private function processBatch($urls, $options) {
        $multiHandle = curl_multi_init();
        $curlHandles = [];
        $results = [];

        foreach ($urls as $index => $url) {
            $curlHandles[$index] = curl_init();

            $defaultOptions = [
                CURLOPT_URL => $url,
                CURLOPT_RETURNTRANSFER => true,
                CURLOPT_FOLLOWLOCATION => true,
                CURLOPT_TIMEOUT => 30,
                CURLOPT_CONNECTTIMEOUT => 10,
                CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; Advanced PHP Scraper)',
                CURLOPT_SSL_VERIFYPEER => false,
                CURLOPT_ENCODING => '', // Enable compression
            ];

            curl_setopt_array($curlHandles[$index], array_merge($defaultOptions, $options));
            curl_multi_add_handle($multiHandle, $curlHandles[$index]);
        }

        // Execute with improved polling
        $running = null;
        do {
            $mrc = curl_multi_exec($multiHandle, $running);
            if ($running) {
                curl_multi_select($multiHandle, 0.1);
            }
        } while ($running > 0);

        // Collect results with error handling
        foreach ($curlHandles as $index => $handle) {
            $content = curl_multi_getcontent($handle);
            $info = curl_getinfo($handle);
            $error = curl_error($handle);

            $results[$index] = [
                'url' => $info['url'],
                'content' => $content,
                'http_code' => $info['http_code'],
                'total_time' => $info['total_time'],
                'content_type' => $info['content_type'],
                'error' => $error,
                'success' => empty($error) && $info['http_code'] >= 200 && $info['http_code'] < 300
            ];

            curl_multi_remove_handle($multiHandle, $handle);
            curl_close($handle);
        }

        curl_multi_close($multiHandle);
        return $results;
    }
}

// Usage example
$scraper = new ConcurrentScraper(5, 500000); // 5 concurrent, 0.5s delay
$urls = ['https://example.com/1', 'https://example.com/2'];
$results = $scraper->scrapeUrls($urls);
?>

Method 2: Using ReactPHP for Asynchronous Scraping

ReactPHP provides true asynchronous programming capabilities for PHP, offering more advanced concurrent scraping options:

<?php
require 'vendor/autoload.php';

use React\EventLoop\Loop;
use React\Socket\Connector;
use React\Stream\WritableResourceStream;
use Psr\Http\Message\ResponseInterface;

function asyncScrapeWithReact($urls) {
    $loop = Loop::get();
    $connector = new Connector($loop);
    $browser = new React\Http\Browser($loop, $connector);

    $promises = [];
    foreach ($urls as $url) {
        $promises[] = $browser->get($url)
            ->then(function (ResponseInterface $response) use ($url) {
                return [
                    'url' => $url,
                    'status' => $response->getStatusCode(),
                    'content' => $response->getBody()->getContents(),
                    'headers' => $response->getHeaders()
                ];
            })
            ->catch(function (Exception $error) use ($url) {
                return [
                    'url' => $url,
                    'error' => $error->getMessage(),
                    'status' => null,
                    'content' => null
                ];
            });
    }

    $results = [];
    React\Promise\all($promises)->then(function ($responses) use (&$results) {
        $results = $responses;
    });

    $loop->run();
    return $results;
}
?>

Method 3: Using Guzzle HTTP Client with Concurrent Requests

Guzzle provides excellent built-in support for concurrent HTTP requests:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Promise;
use GuzzleHttp\Exception\RequestException;

class GuzzleConcurrentScraper {
    private $client;
    private $concurrency;

    public function __construct($concurrency = 10) {
        $this->concurrency = $concurrency;
        $this->client = new Client([
            'timeout' => 30,
            'connect_timeout' => 10,
            'headers' => [
                'User-Agent' => 'Mozilla/5.0 (compatible; Guzzle Scraper)'
            ]
        ]);
    }

    public function scrapeUrls($urls, $options = []) {
        $promises = [];

        foreach ($urls as $index => $url) {
            $promises[$index] = $this->client->getAsync($url, $options)
                ->then(function ($response) use ($url) {
                    return [
                        'url' => $url,
                        'status_code' => $response->getStatusCode(),
                        'content' => $response->getBody()->getContents(),
                        'headers' => $response->getHeaders(),
                        'success' => true
                    ];
                })
                ->otherwise(function ($exception) use ($url) {
                    return [
                        'url' => $url,
                        'error' => $exception->getMessage(),
                        'success' => false
                    ];
                });
        }

        // Wait for all promises to complete
        $responses = Promise\settle($promises)->wait();

        $results = [];
        foreach ($responses as $index => $response) {
            $results[$index] = $response['value'];
        }

        return $results;
    }

    public function scrapeUrlsBatched($urls, $batchSize = null, $options = []) {
        $batchSize = $batchSize ?: $this->concurrency;
        $results = [];

        foreach (array_chunk($urls, $batchSize) as $batch) {
            $batchResults = $this->scrapeUrls($batch, $options);
            $results = array_merge($results, $batchResults);

            // Optional delay between batches
            sleep(1);
        }

        return $results;
    }
}

// Usage
$scraper = new GuzzleConcurrentScraper(5);
$urls = [
    'https://httpbin.org/delay/1',
    'https://httpbin.org/delay/2',
    'https://httpbin.org/json',
    'https://httpbin.org/html'
];

$results = $scraper->scrapeUrls($urls);
foreach ($results as $result) {
    if ($result['success']) {
        echo "✓ {$result['url']}: {$result['status_code']}\n";
    } else {
        echo "✗ {$result['url']}: {$result['error']}\n";
    }
}
?>

Best Practices for Concurrent Scraping

1. Implement Proper Rate Limiting

<?php
class RateLimitedScraper {
    private $requestsPerSecond;
    private $lastRequestTime = 0;

    public function __construct($requestsPerSecond = 10) {
        $this->requestsPerSecond = $requestsPerSecond;
    }

    public function respectRateLimit() {
        $currentTime = microtime(true);
        $timeSinceLastRequest = $currentTime - $this->lastRequestTime;
        $minInterval = 1.0 / $this->requestsPerSecond;

        if ($timeSinceLastRequest < $minInterval) {
            $sleepTime = $minInterval - $timeSinceLastRequest;
            usleep($sleepTime * 1000000);
        }

        $this->lastRequestTime = microtime(true);
    }
}
?>

2. Handle Errors Gracefully

<?php
function robustConcurrentScrape($urls, $maxRetries = 3) {
    $scraper = new ConcurrentScraper();
    $results = [];
    $failed = [];

    foreach ($urls as $url) {
        $attempts = 0;
        $success = false;

        while ($attempts < $maxRetries && !$success) {
            try {
                $result = $scraper->scrapeUrls([$url]);
                if ($result[0]['success']) {
                    $results[] = $result[0];
                    $success = true;
                } else {
                    $failed[] = $url;
                }
            } catch (Exception $e) {
                error_log("Scraping failed for $url: " . $e->getMessage());
            }

            $attempts++;
            if (!$success && $attempts < $maxRetries) {
                sleep(pow(2, $attempts)); // Exponential backoff
            }
        }
    }

    return ['results' => $results, 'failed' => $failed];
}
?>

3. Monitor Memory Usage

<?php
function monitoredConcurrentScrape($urls, $memoryLimit = '256M') {
    $memoryLimitBytes = return_bytes($memoryLimit);

    foreach (array_chunk($urls, 50) as $batch) {
        if (memory_get_usage(true) > $memoryLimitBytes * 0.8) {
            gc_collect_cycles();
            if (memory_get_usage(true) > $memoryLimitBytes * 0.8) {
                error_log("Memory usage high, reducing batch size");
                break;
            }
        }

        // Process batch...
    }
}

function return_bytes($val) {
    $val = trim($val);
    $last = strtolower($val[strlen($val)-1]);
    $val = (int) $val;

    switch($last) {
        case 'g': $val *= 1024;
        case 'm': $val *= 1024;
        case 'k': $val *= 1024;
    }

    return $val;
}
?>

Performance Optimization Tips

1. Connection Pooling

Reuse HTTP connections when possible to reduce overhead:

<?php
$client = new GuzzleHttp\Client([
    'curl' => [
        CURLOPT_MAXCONNECTS => 50,
        CURLOPT_FORBID_REUSE => false,
    ]
]);
?>

2. Response Streaming

For large responses, use streaming to reduce memory usage:

<?php
$promise = $client->getAsync($url, [
    'stream' => true,
    'sink' => fopen('/tmp/large-file.html', 'w')
]);
?>

3. Selective Content Loading

Only download the content you need:

<?php
$options = [
    'headers' => [
        'Range' => 'bytes=0-1024' // Only first 1KB
    ]
];
?>

Comparing with Other Technologies

While PHP provides several options for concurrent scraping, it's worth noting that languages like Node.js with Puppeteer offer different approaches to handling multiple pages in parallel, especially when dealing with JavaScript-heavy websites.

For complex scraping scenarios involving dynamic content, you might need to integrate PHP with headless browsers, though this adds complexity compared to the HTTP-only approaches shown above.

Conclusion

Concurrent scraping in PHP can dramatically improve your data extraction performance. The cURL multi-handle approach provides the most straightforward implementation, while libraries like Guzzle and ReactPHP offer more sophisticated features. Choose the method that best fits your specific requirements, always considering factors like rate limiting, error handling, and resource management.

Remember to respect website robots.txt files, implement appropriate delays between requests, and monitor your resource usage to ensure stable and efficient scraping operations. With proper implementation, concurrent scraping can reduce your data collection time from hours to minutes while maintaining reliability and respecting target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon