Table of contents

How can I extract and parse HTML content from Guzzle responses?

When working with web scraping in PHP, Guzzle is one of the most popular HTTP client libraries for making requests. However, once you receive an HTML response, you need to parse and extract the specific data you're looking for. This guide covers various methods to extract and parse HTML content from Guzzle responses effectively.

Basic HTML Extraction from Guzzle Response

First, let's start with a basic example of how to get HTML content from a Guzzle response:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
$response = $client->request('GET', 'https://example.com');

// Get the HTML content as a string
$html = $response->getBody()->getContents();

// Or convert to string directly
$html = (string) $response->getBody();

echo $html;
?>

Using DOMDocument for HTML Parsing

PHP's built-in DOMDocument class is excellent for parsing HTML and provides robust DOM manipulation capabilities:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
$response = $client->request('GET', 'https://example.com');
$html = $response->getBody()->getContents();

// Create a DOMDocument instance
$dom = new DOMDocument();

// Suppress warnings for malformed HTML
libxml_use_internal_errors(true);

// Load the HTML
$dom->loadHTML($html);

// Clear any libxml errors
libxml_clear_errors();

// Extract data using DOM methods
$titles = $dom->getElementsByTagName('title');
if ($titles->length > 0) {
    echo "Page Title: " . $titles->item(0)->textContent . "\n";
}

// Extract all links
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
    echo "Link: " . $link->getAttribute('href') . " - " . $link->textContent . "\n";
}
?>

Using DOMXPath for Advanced Querying

For more complex HTML parsing, DOMXPath provides powerful query capabilities:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
$response = $client->request('GET', 'https://news.ycombinator.com');
$html = $response->getBody()->getContents();

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

// Create XPath object
$xpath = new DOMXPath($dom);

// Extract specific elements using XPath
$articleTitles = $xpath->query('//a[@class="storylink"]');
foreach ($articleTitles as $title) {
    echo "Article: " . trim($title->textContent) . "\n";
    echo "URL: " . $title->getAttribute('href') . "\n\n";
}

// Extract elements with specific attributes
$metaTags = $xpath->query('//meta[@name="description"]');
if ($metaTags->length > 0) {
    echo "Description: " . $metaTags->item(0)->getAttribute('content') . "\n";
}

// Complex XPath queries
$specificDivs = $xpath->query('//div[contains(@class, "article") and @id]');
foreach ($specificDivs as $div) {
    echo "Div ID: " . $div->getAttribute('id') . "\n";
}
?>

Using Simple HTML DOM Parser

For more user-friendly HTML parsing, you can use the Simple HTML DOM Parser library:

composer require sunra/php-simple-html-dom-parser
<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Sunra\PhpSimple\HtmlDomParser;

$client = new Client();
$response = $client->request('GET', 'https://example.com');
$html = $response->getBody()->getContents();

// Parse HTML with Simple HTML DOM
$dom = HtmlDomParser::str_get_html($html);

if ($dom) {
    // Extract title
    $title = $dom->find('title', 0);
    if ($title) {
        echo "Title: " . $title->plaintext . "\n";
    }

    // Extract all paragraphs
    $paragraphs = $dom->find('p');
    foreach ($paragraphs as $p) {
        echo "Paragraph: " . trim($p->plaintext) . "\n";
    }

    // Extract specific classes
    $articles = $dom->find('.article');
    foreach ($articles as $article) {
        echo "Article HTML: " . $article->outertext . "\n";
    }

    // Extract with attribute filters
    $images = $dom->find('img[src]');
    foreach ($images as $img) {
        echo "Image: " . $img->src . " Alt: " . $img->alt . "\n";
    }
}
?>

Using Symfony DomCrawler

Symfony's DomCrawler component provides a more intuitive API for HTML parsing:

composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client();
$response = $client->request('GET', 'https://example.com');
$html = $response->getBody()->getContents();

// Create Crawler instance
$crawler = new Crawler($html);

// Extract data using CSS selectors
$title = $crawler->filter('title')->text();
echo "Title: " . $title . "\n";

// Extract multiple elements
$links = $crawler->filter('a')->each(function (Crawler $node, $i) {
    return [
        'text' => $node->text(),
        'href' => $node->attr('href')
    ];
});

foreach ($links as $link) {
    echo "Link: " . $link['href'] . " - " . $link['text'] . "\n";
}

// Extract with pseudo-selectors
$firstParagraph = $crawler->filter('p:first-child')->text();
echo "First paragraph: " . $firstParagraph . "\n";

// Extract attributes
$metaDescription = $crawler->filter('meta[name="description"]')->attr('content');
echo "Meta description: " . $metaDescription . "\n";
?>

Handling Different Response Encodings

When working with international websites, proper encoding handling is crucial:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
$response = $client->request('GET', 'https://example.com');

// Get the content type header
$contentType = $response->getHeader('Content-Type')[0] ?? '';

// Extract charset from content type
$charset = 'UTF-8'; // default
if (preg_match('/charset=([^;]+)/', $contentType, $matches)) {
    $charset = trim($matches[1], '"\'');
}

$html = $response->getBody()->getContents();

// Convert encoding if necessary
if (strtoupper($charset) !== 'UTF-8') {
    $html = mb_convert_encoding($html, 'UTF-8', $charset);
}

// Now parse with proper encoding
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
libxml_clear_errors();
?>

Extracting Structured Data (JSON-LD)

Many modern websites include structured data that's easier to parse than HTML:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
$response = $client->request('GET', 'https://example.com');
$html = $response->getBody()->getContents();

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$xpath = new DOMXPath($dom);

// Extract JSON-LD structured data
$jsonLdScripts = $xpath->query('//script[@type="application/ld+json"]');
foreach ($jsonLdScripts as $script) {
    $jsonData = json_decode($script->textContent, true);
    if ($jsonData) {
        echo "Structured data found:\n";
        print_r($jsonData);
    }
}

// Extract Open Graph meta tags
$ogTags = $xpath->query('//meta[starts-with(@property, "og:")]');
foreach ($ogTags as $tag) {
    $property = $tag->getAttribute('property');
    $content = $tag->getAttribute('content');
    echo "OG Tag - $property: $content\n";
}
?>

Error Handling and Best Practices

Always implement proper error handling when parsing HTML content:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;

function parseHtmlFromUrl($url) {
    $client = new Client([
        'timeout' => 30,
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)'
        ]
    ]);

    try {
        $response = $client->request('GET', $url);

        // Check if response is HTML
        $contentType = $response->getHeader('Content-Type')[0] ?? '';
        if (strpos($contentType, 'text/html') === false) {
            throw new Exception("Response is not HTML: $contentType");
        }

        $html = $response->getBody()->getContents();

        if (empty($html)) {
            throw new Exception("Empty HTML response");
        }

        $dom = new DOMDocument();
        libxml_use_internal_errors(true);

        if (!$dom->loadHTML($html)) {
            throw new Exception("Failed to parse HTML");
        }

        libxml_clear_errors();
        return $dom;

    } catch (RequestException $e) {
        echo "HTTP Error: " . $e->getMessage() . "\n";
        return null;
    } catch (Exception $e) {
        echo "Parsing Error: " . $e->getMessage() . "\n";
        return null;
    }
}

// Usage
$dom = parseHtmlFromUrl('https://example.com');
if ($dom) {
    $xpath = new DOMXPath($dom);
    $title = $xpath->query('//title')->item(0);
    if ($title) {
        echo "Title: " . $title->textContent . "\n";
    }
}
?>

Performance Optimization

For large-scale scraping operations, consider these performance optimizations:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

// Concurrent requests for better performance
$client = new Client();
$urls = [
    'https://example1.com',
    'https://example2.com',
    'https://example3.com'
];

$requests = array_map(function ($url) {
    return new Request('GET', $url);
}, $urls);

$pool = new Pool($client, $requests, [
    'concurrency' => 5,
    'fulfilled' => function ($response, $index) use ($urls) {
        $html = $response->getBody()->getContents();
        echo "Processing: " . $urls[$index] . "\n";

        // Parse HTML efficiently
        $dom = new DOMDocument();
        libxml_use_internal_errors(true);
        $dom->loadHTML($html);
        libxml_clear_errors();

        // Extract only what you need
        $title = $dom->getElementsByTagName('title')->item(0);
        if ($title) {
            echo "Title: " . $title->textContent . "\n";
        }
    },
    'rejected' => function ($reason, $index) use ($urls) {
        echo "Failed: " . $urls[$index] . " - " . $reason . "\n";
    },
]);

$promise = $pool->promise();
$promise->wait();
?>

Conclusion

Extracting and parsing HTML content from Guzzle responses can be accomplished through various methods, each with its own advantages. DOMDocument and DOMXPath provide robust, built-in solutions for complex parsing tasks, while libraries like Simple HTML DOM Parser and Symfony DomCrawler offer more user-friendly APIs. The choice depends on your specific requirements, the complexity of the HTML structure, and performance considerations.

For modern web applications that heavily rely on JavaScript, you might need to consider browser automation tools that can handle dynamic content or use specialized techniques for AJAX-heavy websites.

Remember to always implement proper error handling, respect robots.txt files, and consider rate limiting to be a responsible web scraper. With these techniques and best practices, you'll be able to efficiently extract the data you need from HTML responses in your PHP applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon