How do I extract image URLs from a webpage using Simple HTML DOM?

Extracting image URLs from webpages is a common web scraping task, and Simple HTML DOM provides an efficient way to accomplish this in PHP. This guide covers various methods to extract image URLs, handle different image formats, and implement best practices for robust image extraction.

Basic Image URL Extraction

The most straightforward approach to extract image URLs is to target all <img> elements and retrieve their src attributes:

<?php
require_once 'simple_html_dom.php';

// Load HTML from URL
$html = file_get_html('https://example.com');

// Find all img elements and extract src attributes
$images = [];
foreach($html->find('img') as $img) {
    $src = $img->src;
    if (!empty($src)) {
        $images[] = $src;
    }
}

// Display extracted image URLs
foreach($images as $image_url) {
    echo $image_url . "\n";
}

// Clean up memory
$html->clear();
?>

Advanced Image Extraction Techniques

Extracting Images with Specific Classes or IDs

Often, you'll need to target specific images based on their CSS classes or IDs:

<?php
require_once 'simple_html_dom.php';

$html = file_get_html('https://example.com');

// Extract images with specific class
$product_images = [];
foreach($html->find('img.product-image') as $img) {
    $product_images[] = $img->src;
}

// Extract images with specific ID
$hero_image = $html->find('#hero-image', 0);
if ($hero_image) {
    echo "Hero image: " . $hero_image->src . "\n";
}

// Extract images within specific containers
$gallery_images = [];
foreach($html->find('.gallery img') as $img) {
    $gallery_images[] = $img->src;
}

$html->clear();
?>

Handling Different Image Attributes

Modern websites often use additional attributes for responsive images or lazy loading:

<?php
require_once 'simple_html_dom.php';

function extractAllImageSources($html) {
    $images = [];

    foreach($html->find('img') as $img) {
        $image_data = [];

        // Standard src attribute
        if (!empty($img->src)) {
            $image_data['src'] = $img->src;
        }

        // Data attributes for lazy loading
        if (!empty($img->{'data-src'})) {
            $image_data['data-src'] = $img->{'data-src'};
        }

        if (!empty($img->{'data-lazy-src'})) {
            $image_data['data-lazy-src'] = $img->{'data-lazy-src'};
        }

        // Srcset for responsive images
        if (!empty($img->srcset)) {
            $image_data['srcset'] = $img->srcset;
        }

        // Alt text for context
        if (!empty($img->alt)) {
            $image_data['alt'] = $img->alt;
        }

        if (!empty($image_data)) {
            $images[] = $image_data;
        }
    }

    return $images;
}

$html = file_get_html('https://example.com');
$all_images = extractAllImageSources($html);

foreach($all_images as $image) {
    echo "Image data: " . json_encode($image) . "\n";
}

$html->clear();
?>

Converting Relative URLs to Absolute URLs

Many websites use relative URLs for images, so you'll need to convert them to absolute URLs:

<?php
require_once 'simple_html_dom.php';

function convertToAbsoluteUrl($relative_url, $base_url) {
    // If already absolute, return as-is
    if (filter_var($relative_url, FILTER_VALIDATE_URL)) {
        return $relative_url;
    }

    $parsed_base = parse_url($base_url);
    $base = $parsed_base['scheme'] . '://' . $parsed_base['host'];

    // Handle protocol-relative URLs
    if (substr($relative_url, 0, 2) == '//') {
        return $parsed_base['scheme'] . ':' . $relative_url;
    }

    // Handle absolute paths
    if (substr($relative_url, 0, 1) == '/') {
        return $base . $relative_url;
    }

    // Handle relative paths
    $base_path = isset($parsed_base['path']) ? dirname($parsed_base['path']) : '';
    return $base . $base_path . '/' . $relative_url;
}

function extractAbsoluteImageUrls($url) {
    $html = file_get_html($url);
    $images = [];

    foreach($html->find('img') as $img) {
        if (!empty($img->src)) {
            $absolute_url = convertToAbsoluteUrl($img->src, $url);
            $images[] = $absolute_url;
        }
    }

    $html->clear();
    return $images;
}

$website_url = 'https://example.com';
$image_urls = extractAbsoluteImageUrls($website_url);

foreach($image_urls as $url) {
    echo $url . "\n";
}
?>

Filtering Images by File Extension

To extract only specific types of images, you can filter by file extension:

<?php
require_once 'simple_html_dom.php';

function filterImagesByExtension($image_urls, $allowed_extensions = ['jpg', 'jpeg', 'png', 'gif', 'webp']) {
    $filtered_images = [];

    foreach($image_urls as $url) {
        $path_info = pathinfo(parse_url($url, PHP_URL_PATH));
        $extension = isset($path_info['extension']) ? strtolower($path_info['extension']) : '';

        if (in_array($extension, $allowed_extensions)) {
            $filtered_images[] = $url;
        }
    }

    return $filtered_images;
}

$html = file_get_html('https://example.com');
$all_image_urls = [];

foreach($html->find('img') as $img) {
    if (!empty($img->src)) {
        $all_image_urls[] = $img->src;
    }
}

// Filter for common image formats
$image_urls = filterImagesByExtension($all_image_urls);

echo "Found " . count($image_urls) . " valid images:\n";
foreach($image_urls as $url) {
    echo $url . "\n";
}

$html->clear();
?>

Extracting Background Images from CSS

Sometimes images are defined as CSS background images rather than <img> elements:

<?php
require_once 'simple_html_dom.php';

function extractBackgroundImages($html) {
    $background_images = [];

    // Find elements with style attributes
    foreach($html->find('[style]') as $element) {
        $style = $element->style;

        // Look for background-image in style attribute
        if (preg_match('/background-image:\s*url\(["\']?([^"\']+)["\']?\)/', $style, $matches)) {
            $background_images[] = $matches[1];
        }
    }

    return $background_images;
}

$html = file_get_html('https://example.com');

// Extract regular images
$img_sources = [];
foreach($html->find('img') as $img) {
    if (!empty($img->src)) {
        $img_sources[] = $img->src;
    }
}

// Extract background images
$bg_images = extractBackgroundImages($html);

echo "Regular images: " . count($img_sources) . "\n";
echo "Background images: " . count($bg_images) . "\n";

$all_images = array_merge($img_sources, $bg_images);
$unique_images = array_unique($all_images);

foreach($unique_images as $image) {
    echo $image . "\n";
}

$html->clear();
?>

Complete Image Extraction Class

Here's a comprehensive class that combines all the techniques above:

<?php
require_once 'simple_html_dom.php';

class ImageExtractor {
    private $base_url;
    private $allowed_extensions;

    public function __construct($base_url, $allowed_extensions = ['jpg', 'jpeg', 'png', 'gif', 'webp', 'svg']) {
        $this->base_url = $base_url;
        $this->allowed_extensions = $allowed_extensions;
    }

    public function extractImages($url) {
        $html = file_get_html($url);
        if (!$html) {
            throw new Exception("Failed to load HTML from: $url");
        }

        $images = [];

        // Extract from img elements
        $images = array_merge($images, $this->extractImgElements($html));

        // Extract from background images
        $images = array_merge($images, $this->extractBackgroundImages($html));

        // Convert to absolute URLs
        $images = array_map(function($url) {
            return $this->convertToAbsoluteUrl($url);
        }, $images);

        // Filter by extension
        $images = $this->filterByExtension($images);

        // Remove duplicates
        $images = array_unique($images);

        $html->clear();
        return array_values($images);
    }

    private function extractImgElements($html) {
        $images = [];

        foreach($html->find('img') as $img) {
            // Try different source attributes
            $src = $img->src ?: $img->{'data-src'} ?: $img->{'data-lazy-src'};

            if (!empty($src)) {
                $images[] = $src;
            }
        }

        return $images;
    }

    private function extractBackgroundImages($html) {
        $images = [];

        foreach($html->find('[style]') as $element) {
            if (preg_match('/background-image:\s*url\(["\']?([^"\']+)["\']?\)/', $element->style, $matches)) {
                $images[] = $matches[1];
            }
        }

        return $images;
    }

    private function convertToAbsoluteUrl($relative_url) {
        if (filter_var($relative_url, FILTER_VALIDATE_URL)) {
            return $relative_url;
        }

        $parsed_base = parse_url($this->base_url);
        $base = $parsed_base['scheme'] . '://' . $parsed_base['host'];

        if (substr($relative_url, 0, 2) == '//') {
            return $parsed_base['scheme'] . ':' . $relative_url;
        }

        if (substr($relative_url, 0, 1) == '/') {
            return $base . $relative_url;
        }

        $base_path = isset($parsed_base['path']) ? dirname($parsed_base['path']) : '';
        return $base . $base_path . '/' . $relative_url;
    }

    private function filterByExtension($urls) {
        return array_filter($urls, function($url) {
            $path_info = pathinfo(parse_url($url, PHP_URL_PATH));
            $extension = isset($path_info['extension']) ? strtolower($path_info['extension']) : '';
            return in_array($extension, $this->allowed_extensions);
        });
    }
}

// Usage example
$extractor = new ImageExtractor('https://example.com');
$images = $extractor->extractImages('https://example.com/gallery');

echo "Extracted " . count($images) . " images:\n";
foreach($images as $image) {
    echo $image . "\n";
}
?>

Best Practices and Error Handling

When extracting image URLs, consider these best practices:

<?php
// Always include error handling
try {
    $html = file_get_html('https://example.com');

    if (!$html) {
        throw new Exception("Failed to retrieve webpage");
    }

    // Set user agent to avoid blocking
    $context = stream_context_create([
        'http' => [
            'user_agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)'
        ]
    ]);

    // Validate URLs before processing
    $images = [];
    foreach($html->find('img') as $img) {
        if (!empty($img->src) && filter_var($img->src, FILTER_VALIDATE_URL)) {
            $images[] = $img->src;
        }
    }

    $html->clear();

} catch (Exception $e) {
    echo "Error: " . $e->getMessage() . "\n";
}
?>

Alternative Approaches

While Simple HTML DOM is effective for basic image extraction, for more complex scenarios involving JavaScript-rendered content, you might want to consider using headless browser solutions like Puppeteer or handling dynamic content with appropriate waiting strategies.

Conclusion

Simple HTML DOM provides a lightweight and efficient way to extract image URLs from webpages. By combining element selection, attribute extraction, URL conversion, and proper filtering, you can build robust image extraction tools. Remember to handle edge cases like relative URLs, different image attributes, and background images to ensure comprehensive coverage.

The techniques shown here can be adapted for various use cases, from building image galleries to creating content analysis tools. Always respect website terms of service and implement appropriate rate limiting when scraping multiple pages.

Table of contents

How do I extract image URLs from a webpage using Simple HTML DOM?

Basic Image URL Extraction

Advanced Image Extraction Techniques

Extracting Images with Specific Classes or IDs

Handling Different Image Attributes

Converting Relative URLs to Absolute URLs

Filtering Images by File Extension

Extracting Background Images from CSS

Complete Image Extraction Class

Best Practices and Error Handling

Alternative Approaches

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I parse nested HTML structures using Simple HTML DOM?

How do I handle JavaScript-generated content with Simple HTML DOM?

How do I extract text content while preserving line breaks?

Get Started Now

Support