DiDOM is a simple and fast HTML and XML parser for PHP. Setting custom headers for HTTP requests is not a feature of DiDOM itself, as DiDOM is primarily concerned with parsing rather than fetching web content.
However, when you are web scraping, you often need to fetch the HTML content from the web first. To set custom headers for HTTP requests, you would generally use a separate HTTP client, such as cURL in PHP or libraries like Guzzle, and then parse the fetched content with DiDOM.
Here's how you can set custom headers using cURL in PHP and then parse the content with DiDOM:
<?php
require_once 'vendor/autoload.php';
use DiDom\Document;
// Initialize cURL session
$ch = curl_init();
// Set the URL you want to scrape
curl_setopt($ch, CURLOPT_URL, 'http://example.com');
// Set custom headers
$headers = [
'User-Agent: My Custom User Agent',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
// Add more headers as needed
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
// Return the transfer as a string instead of outputting it directly
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Execute the cURL session
$html = curl_exec($ch);
// Close cURL session
curl_close($ch);
// Check if the request was successful
if ($html !== false) {
// Create a DiDOM document instance with the retrieved HTML content
$document = new Document($html);
// Now you can use the DiDOM methods to parse the content
// For example, find an element with the class 'example'
$elements = $document->find('.example');
foreach ($elements as $element) {
echo $element->text(), PHP_EOL;
}
} else {
echo "Error fetching the page", PHP_EOL;
}
If you prefer using a more modern library like Guzzle for making HTTP requests, here's how you would do it:
<?php
require_once 'vendor/autoload.php';
use DiDom\Document;
use GuzzleHttp\Client;
$client = new Client();
// Set custom headers
$headers = [
'User-Agent' => 'My Custom User Agent',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
// Add more headers as needed
];
// Send a GET request
$response = $client->request('GET', 'http://example.com', ['headers' => $headers]);
// Get the body of the response
$html = (string) $response->getBody();
// Create a DiDOM document instance with the retrieved HTML content
$document = new Document($html);
// Now you can use the DiDOM methods to parse the content
// For example, find an element with the class 'example'
$elements = $document->find('.example');
foreach ($elements as $element) {
echo $element->text(), PHP_EOL;
}
In both examples, we first use an HTTP client (cURL or Guzzle) to send a GET request with custom headers to fetch the HTML content from a web page. After that, we create a new Document
object with the fetched HTML content and use DiDOM's parsing capabilities to manipulate and extract the data we need.