How can I use Guzzle to scrape REST APIs effectively?
Guzzle is a powerful PHP HTTP client library that excels at scraping REST APIs efficiently. Unlike traditional web scraping that parses HTML content, API scraping with Guzzle involves making structured HTTP requests to endpoints that return JSON, XML, or other data formats. This approach is often more reliable and faster than scraping rendered web pages.
Setting up Guzzle for API Scraping
Installation and Basic Configuration
First, install Guzzle via Composer:
composer require guzzlehttp/guzzle
Create a basic Guzzle client optimized for API scraping:
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;
$client = new Client([
'base_uri' => 'https://api.example.com/',
'timeout' => 30.0,
'headers' => [
'User-Agent' => 'API-Scraper/1.0',
'Accept' => 'application/json',
'Content-Type' => 'application/json'
],
'verify' => true, // SSL verification
'http_errors' => false // Handle errors manually
]);
Making Basic API Requests
Here's how to make different types of API requests:
// GET request
$response = $client->get('users', [
'query' => [
'page' => 1,
'limit' => 100,
'sort' => 'created_at'
]
]);
$data = json_decode($response->getBody(), true);
$statusCode = $response->getStatusCode();
// POST request with JSON payload
$response = $client->post('users', [
RequestOptions::JSON => [
'name' => 'John Doe',
'email' => 'john@example.com'
]
]);
// PUT request
$response = $client->put('users/123', [
RequestOptions::JSON => [
'name' => 'Jane Doe'
]
]);
// DELETE request
$response = $client->delete('users/123');
Authentication Strategies
API Key Authentication
Many APIs use API keys for authentication:
$client = new Client([
'base_uri' => 'https://api.example.com/',
'headers' => [
'X-API-Key' => 'your-api-key-here',
'Authorization' => 'Bearer your-token-here'
]
]);
// Or pass in query parameters
$response = $client->get('data', [
'query' => [
'api_key' => 'your-api-key-here'
]
]);
OAuth 2.0 Authentication
For OAuth 2.0 protected APIs:
class OAuth2ApiScraper
{
private $client;
private $accessToken;
public function __construct($clientId, $clientSecret, $baseUri)
{
$this->client = new Client(['base_uri' => $baseUri]);
$this->accessToken = $this->getAccessToken($clientId, $clientSecret);
}
private function getAccessToken($clientId, $clientSecret)
{
$response = $this->client->post('oauth/token', [
RequestOptions::JSON => [
'grant_type' => 'client_credentials',
'client_id' => $clientId,
'client_secret' => $clientSecret
]
]);
$data = json_decode($response->getBody(), true);
return $data['access_token'];
}
public function makeAuthenticatedRequest($endpoint, $method = 'GET', $options = [])
{
$options['headers']['Authorization'] = 'Bearer ' . $this->accessToken;
return $this->client->request($method, $endpoint, $options);
}
}
Basic Authentication
For APIs using HTTP Basic Authentication:
$client = new Client([
'base_uri' => 'https://api.example.com/',
'auth' => ['username', 'password']
]);
// Or using the Authorization header directly
$credentials = base64_encode('username:password');
$client = new Client([
'base_uri' => 'https://api.example.com/',
'headers' => [
'Authorization' => 'Basic ' . $credentials
]
]);
Error Handling and Response Validation
Robust error handling is crucial for effective API scraping:
class ApiScraper
{
private $client;
private $maxRetries = 3;
public function scrapeWithRetry($endpoint, $options = [])
{
$retries = 0;
while ($retries < $this->maxRetries) {
try {
$response = $this->client->get($endpoint, $options);
// Validate response
if ($this->isValidResponse($response)) {
return $this->parseResponse($response);
}
throw new Exception('Invalid response received');
} catch (GuzzleHttp\Exception\RequestException $e) {
$retries++;
if ($e->hasResponse()) {
$statusCode = $e->getResponse()->getStatusCode();
// Handle different HTTP status codes
switch ($statusCode) {
case 429: // Rate limited
$this->handleRateLimit($e->getResponse());
break;
case 401: // Unauthorized
$this->refreshAuthentication();
break;
case 500:
case 502:
case 503: // Server errors - retry
sleep(pow(2, $retries)); // Exponential backoff
break;
default:
throw $e; // Don't retry for other errors
}
}
if ($retries >= $this->maxRetries) {
throw $e;
}
}
}
}
private function isValidResponse($response)
{
$statusCode = $response->getStatusCode();
$contentType = $response->getHeaderLine('Content-Type');
return $statusCode >= 200 && $statusCode < 300
&& strpos($contentType, 'application/json') !== false;
}
private function parseResponse($response)
{
$body = $response->getBody()->getContents();
$data = json_decode($body, true);
if (json_last_error() !== JSON_ERROR_NONE) {
throw new Exception('Invalid JSON response: ' . json_last_error_msg());
}
return $data;
}
private function handleRateLimit($response)
{
$retryAfter = $response->getHeaderLine('Retry-After');
$delay = $retryAfter ? (int)$retryAfter : 60;
echo "Rate limited. Waiting {$delay} seconds...\n";
sleep($delay);
}
}
Pagination and Data Collection
Most APIs implement pagination for large datasets:
class PaginatedApiScraper
{
private $client;
public function scrapeAllPages($endpoint, $params = [])
{
$allData = [];
$page = 1;
$hasMoreData = true;
while ($hasMoreData) {
$params['page'] = $page;
$params['per_page'] = 100; // Adjust based on API limits
$response = $this->client->get($endpoint, [
'query' => $params
]);
$data = json_decode($response->getBody(), true);
// Different pagination patterns
if (isset($data['data'])) {
$allData = array_merge($allData, $data['data']);
// Check if there's more data
$hasMoreData = count($data['data']) === $params['per_page'];
} elseif (isset($data['items'])) {
$allData = array_merge($allData, $data['items']);
$hasMoreData = $data['has_more'] ?? false;
} else {
// Handle direct array response
$allData = array_merge($allData, $data);
$hasMoreData = count($data) === $params['per_page'];
}
$page++;
// Add delay to respect rate limits
usleep(250000); // 250ms delay
}
return $allData;
}
public function scrapeCursorPagination($endpoint, $params = [])
{
$allData = [];
$cursor = null;
do {
if ($cursor) {
$params['cursor'] = $cursor;
}
$response = $this->client->get($endpoint, [
'query' => $params
]);
$data = json_decode($response->getBody(), true);
$allData = array_merge($allData, $data['data']);
$cursor = $data['next_cursor'] ?? null;
usleep(250000); // Rate limiting delay
} while ($cursor);
return $allData;
}
}
Performance Optimization
Connection Pooling and Keep-Alive
Configure Guzzle for optimal performance:
$client = new Client([
'base_uri' => 'https://api.example.com/',
'curl' => [
CURLOPT_IPRESOLVE => CURL_IPRESOLVE_V4,
CURLOPT_TCP_KEEPALIVE => 1,
CURLOPT_TCP_KEEPIDLE => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 3
],
'verify' => true,
'version' => '1.1' // Use HTTP/1.1 for better compatibility
]);
Concurrent Requests
For high-performance scraping, use concurrent requests:
use GuzzleHttp\Promise;
use GuzzleHttp\Pool;
class ConcurrentApiScraper
{
private $client;
public function scrapeMultipleEndpoints($endpoints, $concurrency = 10)
{
$requests = function ($endpoints) {
foreach ($endpoints as $endpoint) {
yield $this->client->getAsync($endpoint);
}
};
$results = [];
$pool = new Pool($this->client, $requests($endpoints), [
'concurrency' => $concurrency,
'fulfilled' => function ($response, $index) use (&$results) {
$results[$index] = json_decode($response->getBody(), true);
},
'rejected' => function ($reason, $index) use (&$results) {
$results[$index] = ['error' => $reason->getMessage()];
}
]);
$promise = $pool->promise();
$promise->wait();
return $results;
}
public function scrapeAsync($endpoints)
{
$promises = [];
foreach ($endpoints as $key => $endpoint) {
$promises[$key] = $this->client->getAsync($endpoint);
}
$responses = Promise\settle($promises)->wait();
$results = [];
foreach ($responses as $key => $response) {
if ($response['state'] === 'fulfilled') {
$results[$key] = json_decode(
$response['value']->getBody(),
true
);
} else {
$results[$key] = ['error' => $response['reason']->getMessage()];
}
}
return $results;
}
}
Rate Limiting and Best Practices
Implement sophisticated rate limiting:
class RateLimitedScraper
{
private $client;
private $lastRequestTime = 0;
private $minDelay = 1000000; // 1 second in microseconds
private $requestCount = 0;
private $hourlyLimit = 1000;
private $hourlyReset;
public function __construct()
{
$this->client = new Client();
$this->hourlyReset = time() + 3600;
}
public function makeRequest($endpoint, $options = [])
{
$this->enforceRateLimit();
$response = $this->client->get($endpoint, $options);
// Update rate limit info from response headers
$this->updateRateLimitInfo($response);
return $response;
}
private function enforceRateLimit()
{
// Check hourly limit
if (time() > $this->hourlyReset) {
$this->requestCount = 0;
$this->hourlyReset = time() + 3600;
}
if ($this->requestCount >= $this->hourlyLimit) {
$sleepTime = $this->hourlyReset - time();
echo "Hourly limit reached. Sleeping for {$sleepTime} seconds...\n";
sleep($sleepTime);
$this->requestCount = 0;
$this->hourlyReset = time() + 3600;
}
// Enforce minimum delay between requests
$timeSinceLastRequest = microtime(true) * 1000000 - $this->lastRequestTime;
if ($timeSinceLastRequest < $this->minDelay) {
usleep($this->minDelay - $timeSinceLastRequest);
}
$this->lastRequestTime = microtime(true) * 1000000;
$this->requestCount++;
}
private function updateRateLimitInfo($response)
{
// Parse rate limit headers (varies by API)
$remaining = $response->getHeaderLine('X-RateLimit-Remaining');
$reset = $response->getHeaderLine('X-RateLimit-Reset');
if ($remaining && $reset) {
$resetTime = (int)$reset;
$remainingRequests = (int)$remaining;
if ($remainingRequests < 10 && $resetTime > time()) {
$sleepTime = $resetTime - time();
echo "Approaching rate limit. Sleeping for {$sleepTime} seconds...\n";
sleep($sleepTime);
}
}
}
}
Advanced Features and Middleware
Custom Middleware for Logging and Monitoring
use GuzzleHttp\Middleware;
use GuzzleHttp\HandlerStack;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
class ApiScraperWithMiddleware
{
private $client;
public function __construct()
{
$stack = HandlerStack::create();
// Add logging middleware
$stack->push(Middleware::mapRequest(function (RequestInterface $request) {
echo "Making request to: " . $request->getUri() . "\n";
return $request;
}));
// Add response logging
$stack->push(Middleware::mapResponse(function (ResponseInterface $response) {
echo "Response status: " . $response->getStatusCode() . "\n";
return $response;
}));
// Add retry middleware
$stack->push(Middleware::retry(
function ($retries, $request, $response, $exception) {
return $retries < 3 && (
$exception instanceof \GuzzleHttp\Exception\ConnectException
|| ($response && $response->getStatusCode() >= 500)
);
},
function ($retries) {
return 1000 * pow(2, $retries); // Exponential backoff
}
));
$this->client = new Client(['handler' => $stack]);
}
}
When working with complex web applications that combine API scraping with browser automation, you might need to handle AJAX requests effectively or manage authentication flows for comprehensive data collection.
Data Storage and Processing
Store scraped API data efficiently:
class ApiDataProcessor
{
private $pdo;
public function __construct($dsn, $username, $password)
{
$this->pdo = new PDO($dsn, $username, $password, [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC
]);
}
public function storeApiData($data, $source)
{
$stmt = $this->pdo->prepare("
INSERT INTO scraped_data (source, data, scraped_at)
VALUES (?, ?, NOW())
");
$stmt->execute([$source, json_encode($data)]);
return $this->pdo->lastInsertId();
}
public function batchInsert($dataArray, $source)
{
$this->pdo->beginTransaction();
try {
$stmt = $this->pdo->prepare("
INSERT INTO scraped_data (source, data, scraped_at)
VALUES (?, ?, NOW())
");
foreach ($dataArray as $data) {
$stmt->execute([$source, json_encode($data)]);
}
$this->pdo->commit();
} catch (Exception $e) {
$this->pdo->rollback();
throw $e;
}
}
}
Conclusion
Guzzle provides a robust foundation for REST API scraping with its comprehensive HTTP client capabilities. Key strategies for effective API scraping include proper authentication handling, intelligent error recovery, respectful rate limiting, and efficient data processing. By implementing these patterns, you can build reliable and scalable API scraping solutions that respect service boundaries while maximizing data collection efficiency.
Remember to always review API documentation, respect rate limits, and implement proper error handling to ensure your scraping operations remain stable and maintainable over time.