Table of contents

How do I extract all links from a webpage using Simple HTML DOM?

Simple HTML DOM Parser is a popular PHP library that makes it easy to extract and manipulate HTML elements from web pages. One of the most common web scraping tasks is extracting all links from a webpage, which is essential for building web crawlers, link checkers, or SEO analysis tools. This comprehensive guide will show you multiple methods to extract links using Simple HTML DOM Parser.

Installing Simple HTML DOM Parser

Before you can extract links, you need to install the Simple HTML DOM Parser library. You can install it using Composer:

composer require sunra/php-simple-html-dom-parser

Alternatively, you can download the library directly and include it in your project:

<?php
require_once 'simple_html_dom.php';

Basic Link Extraction

The most straightforward way to extract all links from a webpage is to use the find() method to select all anchor (<a>) tags:

<?php
require_once 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;

// Load HTML from URL
$html = HtmlDomParser::file_get_html('https://example.com');

// Find all anchor tags
$links = $html->find('a');

// Extract href attributes
foreach($links as $link) {
    if($link->href) {
        echo $link->href . "\n";
    }
}

// Clean up memory
$html->clear();
?>

Extracting Links with Additional Information

For more comprehensive link analysis, you might want to extract not just the URL but also the link text, title attributes, and other metadata:

<?php
require_once 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;

function extractLinksWithDetails($url) {
    $html = HtmlDomParser::file_get_html($url);

    if(!$html) {
        return false;
    }

    $links = [];

    foreach($html->find('a') as $link) {
        if($link->href) {
            $links[] = [
                'url' => $link->href,
                'text' => trim($link->plaintext),
                'title' => $link->title ?? '',
                'target' => $link->target ?? '',
                'rel' => $link->rel ?? ''
            ];
        }
    }

    $html->clear();
    return $links;
}

// Usage
$links = extractLinksWithDetails('https://example.com');

foreach($links as $link) {
    echo "URL: " . $link['url'] . "\n";
    echo "Text: " . $link['text'] . "\n";
    echo "Title: " . $link['title'] . "\n";
    echo "---\n";
}
?>

Filtering Links by Type

You can filter links based on specific criteria such as internal vs. external links, or links with specific attributes:

<?php
require_once 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;

function filterLinks($url, $baseUrl = null) {
    $html = HtmlDomParser::file_get_html($url);

    if(!$html) {
        return false;
    }

    $internalLinks = [];
    $externalLinks = [];
    $emailLinks = [];
    $telephoneLinks = [];

    foreach($html->find('a') as $link) {
        if(!$link->href) continue;

        $href = $link->href;

        // Email links
        if(strpos($href, 'mailto:') === 0) {
            $emailLinks[] = [
                'url' => $href,
                'text' => trim($link->plaintext)
            ];
        }
        // Telephone links
        elseif(strpos($href, 'tel:') === 0) {
            $telephoneLinks[] = [
                'url' => $href,
                'text' => trim($link->plaintext)
            ];
        }
        // External links (contain http/https and different domain)
        elseif(preg_match('/^https?:\/\//', $href)) {
            if($baseUrl && strpos($href, $baseUrl) !== 0) {
                $externalLinks[] = [
                    'url' => $href,
                    'text' => trim($link->plaintext)
                ];
            }
        }
        // Internal links (relative or same domain)
        else {
            $internalLinks[] = [
                'url' => $href,
                'text' => trim($link->plaintext)
            ];
        }
    }

    $html->clear();

    return [
        'internal' => $internalLinks,
        'external' => $externalLinks,
        'email' => $emailLinks,
        'telephone' => $telephoneLinks
    ];
}

// Usage
$categorizedLinks = filterLinks('https://example.com', 'https://example.com');

echo "Internal Links: " . count($categorizedLinks['internal']) . "\n";
echo "External Links: " . count($categorizedLinks['external']) . "\n";
echo "Email Links: " . count($categorizedLinks['email']) . "\n";
echo "Telephone Links: " . count($categorizedLinks['telephone']) . "\n";
?>

Advanced Link Extraction with CSS Selectors

Simple HTML DOM Parser supports CSS selectors, allowing for more sophisticated link extraction:

<?php
require_once 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;

function extractLinksWithSelectors($url) {
    $html = HtmlDomParser::file_get_html($url);

    if(!$html) {
        return false;
    }

    $results = [];

    // Links in navigation
    $navLinks = $html->find('nav a, .navigation a, .menu a');
    $results['navigation'] = [];
    foreach($navLinks as $link) {
        if($link->href) {
            $results['navigation'][] = [
                'url' => $link->href,
                'text' => trim($link->plaintext)
            ];
        }
    }

    // Links in main content
    $contentLinks = $html->find('main a, .content a, article a');
    $results['content'] = [];
    foreach($contentLinks as $link) {
        if($link->href) {
            $results['content'][] = [
                'url' => $link->href,
                'text' => trim($link->plaintext)
            ];
        }
    }

    // Links with specific classes
    $buttonLinks = $html->find('a.button, a.btn, a.cta');
    $results['buttons'] = [];
    foreach($buttonLinks as $link) {
        if($link->href) {
            $results['buttons'][] = [
                'url' => $link->href,
                'text' => trim($link->plaintext),
                'class' => $link->class
            ];
        }
    }

    // Links that open in new window/tab
    $newWindowLinks = $html->find('a[target="_blank"]');
    $results['new_window'] = [];
    foreach($newWindowLinks as $link) {
        if($link->href) {
            $results['new_window'][] = [
                'url' => $link->href,
                'text' => trim($link->plaintext)
            ];
        }
    }

    $html->clear();
    return $results;
}

// Usage
$categorizedLinks = extractLinksWithSelectors('https://example.com');

foreach($categorizedLinks as $category => $links) {
    echo ucfirst($category) . " Links (" . count($links) . "):\n";
    foreach($links as $link) {
        echo "  - " . $link['text'] . " -> " . $link['url'] . "\n";
    }
    echo "\n";
}
?>

Handling Large Pages and Memory Management

When working with large pages or processing many pages, memory management becomes important:

<?php
require_once 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;

function extractLinksEfficiently($urls) {
    $allLinks = [];

    foreach($urls as $url) {
        echo "Processing: $url\n";

        // Set memory limit for large pages
        ini_set('memory_limit', '256M');

        $html = HtmlDomParser::file_get_html($url);

        if(!$html) {
            echo "Failed to load: $url\n";
            continue;
        }

        $pageLinks = [];
        $links = $html->find('a');

        foreach($links as $link) {
            if($link->href) {
                $pageLinks[] = $link->href;
            }
        }

        $allLinks[$url] = array_unique($pageLinks);

        // Important: Clear memory after each page
        $html->clear();
        unset($html, $pageLinks, $links);

        // Optional: Force garbage collection
        gc_collect_cycles();

        // Be respectful: add delay between requests
        sleep(1);
    }

    return $allLinks;
}

// Usage
$urls = [
    'https://example.com',
    'https://example.com/about',
    'https://example.com/contact'
];

$results = extractLinksEfficiently($urls);

foreach($results as $url => $links) {
    echo "$url has " . count($links) . " links\n";
}
?>

Error Handling and Validation

Robust link extraction should include proper error handling and URL validation:

<?php
require_once 'vendor/autoload.php';
use Sunra\PhpSimple\HtmlDomParser;

function extractLinksWithValidation($url) {
    try {
        // Validate input URL
        if(!filter_var($url, FILTER_VALIDATE_URL)) {
            throw new InvalidArgumentException("Invalid URL provided");
        }

        // Set user agent to avoid blocking
        $context = stream_context_create([
            'http' => [
                'user_agent' => 'Mozilla/5.0 (compatible; LinkExtractor/1.0)',
                'timeout' => 30
            ]
        ]);

        $html = HtmlDomParser::file_get_html($url, false, $context);

        if(!$html) {
            throw new Exception("Failed to load webpage");
        }

        $validLinks = [];
        $invalidLinks = [];

        foreach($html->find('a') as $link) {
            if(!$link->href) continue;

            $href = trim($link->href);

            // Skip empty or javascript links
            if(empty($href) || strpos($href, 'javascript:') === 0) {
                continue;
            }

            // Convert relative URLs to absolute
            if(!preg_match('/^https?:\/\//', $href)) {
                $href = rtrim($url, '/') . '/' . ltrim($href, '/');
            }

            // Validate the constructed URL
            if(filter_var($href, FILTER_VALIDATE_URL)) {
                $validLinks[] = [
                    'url' => $href,
                    'text' => trim($link->plaintext),
                    'original' => $link->href
                ];
            } else {
                $invalidLinks[] = $link->href;
            }
        }

        $html->clear();

        return [
            'valid' => $validLinks,
            'invalid' => $invalidLinks,
            'total' => count($validLinks) + count($invalidLinks)
        ];

    } catch(Exception $e) {
        error_log("Link extraction error: " . $e->getMessage());
        return false;
    }
}

// Usage
$result = extractLinksWithValidation('https://example.com');

if($result) {
    echo "Valid links: " . count($result['valid']) . "\n";
    echo "Invalid links: " . count($result['invalid']) . "\n";
    echo "Total processed: " . $result['total'] . "\n";
} else {
    echo "Failed to extract links\n";
}
?>

Integration with Other Tools

For more complex web scraping scenarios, you might want to combine Simple HTML DOM with other tools. While Simple HTML DOM is excellent for static content, handling dynamic content that loads after page load with JavaScript requires browser automation tools like Puppeteer.

Best Practices and Tips

  1. Memory Management: Always call $html->clear() after processing to free memory
  2. Rate Limiting: Add delays between requests to be respectful to target servers
  3. User Agent: Set a proper user agent to avoid being blocked
  4. Error Handling: Implement comprehensive error handling for network issues
  5. URL Validation: Validate and normalize URLs before processing
  6. Relative URLs: Convert relative URLs to absolute URLs for consistency

Performance Considerations

When extracting links from multiple pages, consider implementing:

  • Caching: Store results to avoid re-processing the same pages
  • Parallel Processing: Use libraries like ReactPHP for concurrent requests
  • Database Storage: Store results in a database for large-scale operations
  • Queue Systems: Use job queues for processing large numbers of URLs

Simple HTML DOM Parser is an excellent choice for extracting links from static HTML content. For websites that rely heavily on JavaScript for navigation between different pages, you may need to combine it with browser automation tools to ensure you capture all dynamically loaded links.

This comprehensive approach to link extraction will help you build robust web scraping applications that can efficiently process and analyze web content while maintaining good performance and reliability.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon