How do I set custom headers when fetching HTML with Simple HTML DOM?
Setting custom headers when fetching HTML with Simple HTML DOM is essential for many web scraping scenarios. While Simple HTML DOM doesn't provide built-in header functionality, you can achieve this using PHP's stream context, cURL, or by integrating with HTTP libraries like Guzzle. This guide covers multiple approaches to handle custom headers effectively.
Understanding Simple HTML DOM Limitations
Simple HTML DOM parser is primarily designed for parsing HTML content rather than fetching it. The library's file_get_html()
function uses PHP's file_get_contents()
internally, which has limited HTTP header support. However, we can work around this limitation using several methods.
Method 1: Using Stream Context with file_get_html()
The most straightforward approach is to create a stream context with custom headers and pass it to Simple HTML DOM:
<?php
require_once 'simple_html_dom.php';
// Create stream context with custom headers
$context = stream_context_create([
'http' => [
'method' => 'GET',
'header' => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate',
'Connection: keep-alive',
'Referer: https://example.com',
'Authorization: Bearer your-token-here'
],
'timeout' => 30,
'follow_location' => true,
'max_redirects' => 3
]
]);
// Fetch HTML with custom headers
$html = file_get_html('https://example.com/page', false, $context);
if ($html) {
// Parse the content
$title = $html->find('title', 0)->plaintext;
echo "Page title: " . $title . "\n";
// Clean up
$html->clear();
} else {
echo "Failed to fetch the page\n";
}
?>
Method 2: Using cURL with Simple HTML DOM
For more control over HTTP requests, combine cURL with Simple HTML DOM's string parsing capabilities:
<?php
require_once 'simple_html_dom.php';
function fetchWithCustomHeaders($url, $headers = []) {
$ch = curl_init();
// Default headers
$defaultHeaders = [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Connection: keep-alive'
];
// Merge custom headers with defaults
$allHeaders = array_merge($defaultHeaders, $headers);
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 3,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => $allHeaders,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_ENCODING => 'gzip, deflate'
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if (curl_errno($ch)) {
$error = curl_error($ch);
curl_close($ch);
throw new Exception("cURL error: " . $error);
}
curl_close($ch);
if ($httpCode !== 200) {
throw new Exception("HTTP error: " . $httpCode);
}
return $response;
}
// Usage example
try {
$customHeaders = [
'Authorization: Bearer your-api-token',
'X-Custom-Header: your-value',
'Referer: https://referring-site.com'
];
$htmlContent = fetchWithCustomHeaders('https://example.com/api/data', $customHeaders);
// Parse with Simple HTML DOM
$html = str_get_html($htmlContent);
if ($html) {
// Extract data
$products = $html->find('.product');
foreach ($products as $product) {
$name = $product->find('.product-name', 0)->plaintext;
$price = $product->find('.price', 0)->plaintext;
echo "Product: {$name}, Price: {$price}\n";
}
$html->clear();
}
} catch (Exception $e) {
echo "Error: " . $e->getMessage() . "\n";
}
?>
Method 3: Integration with Guzzle HTTP
For enterprise-grade applications, consider using Guzzle HTTP client with Simple HTML DOM:
<?php
require_once 'simple_html_dom.php';
require_once 'vendor/autoload.php'; // Composer autoload
use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
class WebScraperWithHeaders {
private $client;
public function __construct() {
$this->client = new Client([
'timeout' => 30,
'allow_redirects' => [
'max' => 3,
'strict' => true,
'referer' => true
],
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive'
]
]);
}
public function scrapeWithHeaders($url, $additionalHeaders = []) {
try {
$response = $this->client->request('GET', $url, [
'headers' => $additionalHeaders
]);
$htmlContent = $response->getBody()->getContents();
// Parse with Simple HTML DOM
$html = str_get_html($htmlContent);
return $html;
} catch (RequestException $e) {
throw new Exception("Request failed: " . $e->getMessage());
}
}
public function scrapeProtectedContent($url, $token) {
$headers = [
'Authorization' => 'Bearer ' . $token,
'X-Requested-With' => 'XMLHttpRequest',
'Content-Type' => 'application/json'
];
return $this->scrapeWithHeaders($url, $headers);
}
}
// Usage example
$scraper = new WebScraperWithHeaders();
try {
// Scrape with custom headers
$html = $scraper->scrapeWithHeaders('https://api.example.com/data', [
'X-API-Key' => 'your-api-key',
'X-Client-Version' => '1.0.0'
]);
if ($html) {
$data = $html->find('.data-item');
foreach ($data as $item) {
echo $item->plaintext . "\n";
}
$html->clear();
}
} catch (Exception $e) {
echo "Scraping failed: " . $e->getMessage() . "\n";
}
?>
Advanced Header Management
Session and Cookie Handling
When working with sites that require session management, combine headers with cookie handling:
<?php
function scrapeWithSession($loginUrl, $targetUrl, $credentials) {
$cookieJar = tempnam(sys_get_temp_dir(), 'cookies');
// Login request
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $loginUrl,
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => http_build_query($credentials),
CURLOPT_RETURNTRANSFER => true,
CURLOPT_COOKIEJAR => $cookieJar,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Content-Type: application/x-www-form-urlencoded',
'X-Requested-With: XMLHttpRequest'
]
]);
curl_exec($ch);
curl_close($ch);
// Authenticated request
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $targetUrl,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_COOKIEFILE => $cookieJar,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept: text/html,application/xhtml+xml',
'Referer: ' . $loginUrl
]
]);
$response = curl_exec($ch);
curl_close($ch);
// Clean up
unlink($cookieJar);
return str_get_html($response);
}
?>
Using JavaScript Code Examples
For developers working in JavaScript environments, here's how you can achieve similar functionality with Node.js and Cheerio (JavaScript equivalent to Simple HTML DOM):
const axios = require('axios');
const cheerio = require('cheerio');
async function fetchWithCustomHeaders(url, customHeaders = {}) {
const defaultHeaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive'
};
const headers = { ...defaultHeaders, ...customHeaders };
try {
const response = await axios.get(url, {
headers,
timeout: 30000,
maxRedirects: 3
});
const $ = cheerio.load(response.data);
return $;
} catch (error) {
throw new Error(`Request failed: ${error.message}`);
}
}
// Usage example
(async () => {
try {
const customHeaders = {
'Authorization': 'Bearer your-api-token',
'X-Custom-Header': 'your-value',
'Referer': 'https://referring-site.com'
};
const $ = await fetchWithCustomHeaders('https://example.com/api/data', customHeaders);
// Extract data
$('.product').each((index, element) => {
const name = $(element).find('.product-name').text();
const price = $(element).find('.price').text();
console.log(`Product: ${name}, Price: ${price}`);
});
} catch (error) {
console.error('Error:', error.message);
}
})();
Common Headers for Web Scraping
Here are essential headers for successful web scraping:
<?php
$commonHeaders = [
// Browser identification
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
// Content preferences
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.9',
'Accept-Encoding: gzip, deflate, br',
// Connection settings
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
// Security headers
'Sec-Fetch-Dest: document',
'Sec-Fetch-Mode: navigate',
'Sec-Fetch-Site: none',
// Custom application headers
'X-Requested-With: XMLHttpRequest',
'Cache-Control: no-cache'
];
?>
Error Handling and Best Practices
Always implement proper error handling when working with custom headers:
<?php
function robustScrapeWithHeaders($url, $headers = [], $maxRetries = 3) {
$attempt = 0;
while ($attempt < $maxRetries) {
try {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => $headers,
CURLOPT_TIMEOUT => 30,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_FOLLOWLOCATION => true
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);
if ($error) {
throw new Exception("cURL error: " . $error);
}
if ($httpCode >= 200 && $httpCode < 300) {
return str_get_html($response);
} elseif ($httpCode === 429) {
// Rate limited, wait and retry
sleep(pow(2, $attempt)); // Exponential backoff
$attempt++;
continue;
} else {
throw new Exception("HTTP error: " . $httpCode);
}
} catch (Exception $e) {
$attempt++;
if ($attempt >= $maxRetries) {
throw $e;
}
sleep(1); // Brief pause before retry
}
}
return false;
}
?>
Command Line Examples
For testing header configurations, you can use curl from the command line:
# Test basic custom headers
curl -H "User-Agent: Custom-Bot/1.0" \
-H "Authorization: Bearer token123" \
-H "Accept: text/html" \
https://example.com/api/data
# Test with multiple headers and follow redirects
curl -L -H "User-Agent: Mozilla/5.0 (compatible; CustomBot/1.0)" \
-H "Referer: https://example.com" \
-H "X-API-Key: your-api-key" \
-o response.html \
https://api.example.com/data
# Test with cookie support
curl -c cookies.txt -b cookies.txt \
-H "User-Agent: Custom-Agent" \
-H "Content-Type: application/json" \
https://example.com/login
Integration with Modern Tools
For complex scenarios requiring JavaScript execution, consider combining Simple HTML DOM with tools that support handling authentication flows or explore browser session management for protected content.
Performance Considerations
When using custom headers with Simple HTML DOM:
Connection Pooling
<?php
class PooledScraper {
private static $curlMultiHandle;
private static $curlHandles = [];
public static function initPool($maxConnections = 10) {
self::$curlMultiHandle = curl_multi_init();
curl_multi_setopt(self::$curlMultiHandle, CURLMOPT_MAXCONNECTS, $maxConnections);
}
public static function addRequest($url, $headers = []) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => $headers,
CURLOPT_TIMEOUT => 30
]);
curl_multi_add_handle(self::$curlMultiHandle, $ch);
self::$curlHandles[] = $ch;
return $ch;
}
public static function executeAll() {
$running = null;
do {
curl_multi_exec(self::$curlMultiHandle, $running);
curl_multi_select(self::$curlMultiHandle);
} while ($running > 0);
$results = [];
foreach (self::$curlHandles as $ch) {
$response = curl_multi_getcontent($ch);
$results[] = str_get_html($response);
curl_multi_remove_handle(self::$curlMultiHandle, $ch);
curl_close($ch);
}
return $results;
}
}
?>
Memory Management
<?php
function scrapeWithMemoryControl($urls, $headers = []) {
$results = [];
foreach ($urls as $url) {
// Process one URL at a time to control memory
$html = fetchWithCustomHeaders($url, $headers);
if ($html) {
// Extract only needed data
$data = extractData($html);
$results[] = $data;
// Clean up immediately
$html->clear();
unset($html);
// Force garbage collection periodically
if (count($results) % 100 === 0) {
gc_collect_cycles();
}
}
}
return $results;
}
?>
Security Considerations
When setting custom headers, be mindful of security implications:
Header Sanitization
<?php
function sanitizeHeaders($headers) {
$sanitized = [];
foreach ($headers as $header) {
// Remove potential header injection
$header = str_replace(["\r", "\n"], '', $header);
// Validate header format
if (preg_match('/^[a-zA-Z0-9\-]+:\s*.+$/', $header)) {
$sanitized[] = $header;
}
}
return $sanitized;
}
$userHeaders = $_POST['headers'] ?? [];
$safeHeaders = sanitizeHeaders($userHeaders);
?>
Conclusion
Setting custom headers with Simple HTML DOM requires working around the library's limitations by using PHP's native HTTP capabilities or integrating with more powerful HTTP clients. The stream context method works well for simple scenarios, while cURL provides more control, and Guzzle offers enterprise-grade features. Choose the approach that best fits your project's complexity and requirements.
Remember to always respect website terms of service, implement proper rate limiting, and handle errors gracefully. Custom headers are powerful tools for web scraping, but they should be used responsibly and ethically.