How do I extract all links from an HTML document using Html Agility Pack?

Extracting links from HTML documents is one of the most common web scraping tasks. Html Agility Pack provides powerful methods to parse HTML and extract anchor (<a>) tags efficiently. This guide covers various approaches to extract links, handle different link types, and process link attributes.

Basic Link Extraction

The simplest way to extract all links is to select all anchor tags and retrieve their href attributes:

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;

// Load HTML from URL
var web = new HtmlWeb();
var doc = web.Load("https://example.com");

// Extract all links using XPath
var links = doc.DocumentNode
    .SelectNodes("//a[@href]")
    ?.Select(node => node.GetAttributeValue("href", ""))
    .Where(href => !string.IsNullOrEmpty(href))
    .ToList() ?? new List<string>();

foreach (var link in links)
{
    Console.WriteLine(link);
}

Alternative CSS Selector Approach

You can also use CSS-like selectors with Html Agility Pack extensions:

using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;

var web = new HtmlWeb();
var doc = web.Load("https://example.com");

// Using CSS selector (requires Fizzler.Systems.HtmlAgilityPack NuGet package)
var linkNodes = doc.DocumentNode.QuerySelectorAll("a[href]");

var links = linkNodes
    .Select(node => node.GetAttributeValue("href", ""))
    .Where(href => !string.IsNullOrEmpty(href))
    .ToList();

Extracting Detailed Link Information

For more comprehensive link extraction, collect additional attributes like link text, title, and target:

public class LinkInfo
{
    public string Url { get; set; }
    public string Text { get; set; }
    public string Title { get; set; }
    public string Target { get; set; }
    public bool IsExternal { get; set; }
}

public static List<LinkInfo> ExtractDetailedLinks(HtmlDocument doc, string baseUrl = "")
{
    var links = new List<LinkInfo>();
    var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

    if (linkNodes == null) return links;

    var baseUri = string.IsNullOrEmpty(baseUrl) ? null : new Uri(baseUrl);

    foreach (var node in linkNodes)
    {
        var href = node.GetAttributeValue("href", "");
        if (string.IsNullOrEmpty(href)) continue;

        var linkInfo = new LinkInfo
        {
            Url = href,
            Text = node.InnerText?.Trim() ?? "",
            Title = node.GetAttributeValue("title", ""),
            Target = node.GetAttributeValue("target", "")
        };

        // Determine if link is external
        if (baseUri != null && Uri.TryCreate(baseUri, href, out var absoluteUri))
        {
            linkInfo.Url = absoluteUri.ToString();
            linkInfo.IsExternal = !absoluteUri.Host.Equals(baseUri.Host, StringComparison.OrdinalIgnoreCase);
        }
        else if (href.StartsWith("http"))
        {
            linkInfo.IsExternal = true;
        }

        links.Add(linkInfo);
    }

    return links;
}

// Usage
var web = new HtmlWeb();
var doc = web.Load("https://example.com");
var detailedLinks = ExtractDetailedLinks(doc, "https://example.com");

foreach (var link in detailedLinks)
{
    Console.WriteLine($"URL: {link.Url}");
    Console.WriteLine($"Text: {link.Text}");
    Console.WriteLine($"External: {link.IsExternal}");
    Console.WriteLine("---");
}

Filtering Links by Type

Extract specific types of links based on URL patterns or attributes:

public static class LinkExtractor
{
    // Extract only external links
    public static List<string> ExtractExternalLinks(HtmlDocument doc, string baseHost)
    {
        return doc.DocumentNode
            .SelectNodes("//a[@href]")
            ?.Select(node => node.GetAttributeValue("href", ""))
            .Where(href => href.StartsWith("http") && !href.Contains(baseHost))
            .ToList() ?? new List<string>();
    }

    // Extract only internal links
    public static List<string> ExtractInternalLinks(HtmlDocument doc)
    {
        return doc.DocumentNode
            .SelectNodes("//a[@href]")
            ?.Select(node => node.GetAttributeValue("href", ""))
            .Where(href => !href.StartsWith("http") && !href.StartsWith("mailto:") && !href.StartsWith("tel:"))
            .ToList() ?? new List<string>();
    }

    // Extract download links (files)
    public static List<string> ExtractDownloadLinks(HtmlDocument doc)
    {
        var fileExtensions = new[] { ".pdf", ".doc", ".docx", ".xls", ".xlsx", ".zip", ".rar", ".exe" };

        return doc.DocumentNode
            .SelectNodes("//a[@href]")
            ?.Select(node => node.GetAttributeValue("href", ""))
            .Where(href => fileExtensions.Any(ext => href.EndsWith(ext, StringComparison.OrdinalIgnoreCase)))
            .ToList() ?? new List<string>();
    }

    // Extract email links
    public static List<string> ExtractEmailLinks(HtmlDocument doc)
    {
        return doc.DocumentNode
            .SelectNodes("//a[starts-with(@href, 'mailto:')]")
            ?.Select(node => node.GetAttributeValue("href", "").Replace("mailto:", ""))
            .ToList() ?? new List<string>();
    }
}

Loading HTML from Different Sources

Html Agility Pack supports loading HTML from various sources:

// From URL
var web = new HtmlWeb();
var doc = web.Load("https://example.com");

// From HTML string
string htmlContent = "<html><body><a href='https://example.com'>Link</a></body></html>";
var doc2 = new HtmlDocument();
doc2.LoadHtml(htmlContent);

// From file
var doc3 = new HtmlDocument();
doc3.Load("path/to/file.html");

// From stream
using var stream = new FileStream("path/to/file.html", FileMode.Open);
var doc4 = new HtmlDocument();
doc4.Load(stream);

Error Handling and Validation

Implement robust error handling when extracting links:

public static List<LinkInfo> SafeExtractLinks(string url, int timeoutSeconds = 30)
{
    var links = new List<LinkInfo>();

    try
    {
        var web = new HtmlWeb();
        web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36";
        web.Timeout = timeoutSeconds * 1000;

        var doc = web.Load(url);

        if (doc == null)
        {
            Console.WriteLine("Failed to load document");
            return links;
        }

        var linkNodes = doc.DocumentNode?.SelectNodes("//a[@href]");

        if (linkNodes == null)
        {
            Console.WriteLine("No links found in document");
            return links;
        }

        foreach (var node in linkNodes)
        {
            try
            {
                var href = node.GetAttributeValue("href", "");
                if (string.IsNullOrWhiteSpace(href)) continue;

                // Validate URL format
                if (Uri.TryCreate(url, href, out var validUri))
                {
                    links.Add(new LinkInfo
                    {
                        Url = validUri.ToString(),
                        Text = HtmlEntity.DeEntitize(node.InnerText?.Trim() ?? ""),
                        Title = HtmlEntity.DeEntitize(node.GetAttributeValue("title", "")),
                        Target = node.GetAttributeValue("target", "")
                    });
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error processing link: {ex.Message}");
            }
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error loading page: {ex.Message}");
    }

    return links;
}

Advanced Link Processing

For complex scenarios, you might need to process links found within specific containers or with certain patterns:

public static Dictionary<string, List<string>> ExtractLinksBySection(HtmlDocument doc)
{
    var linksBySection = new Dictionary<string, List<string>>();

    // Extract links from navigation
    var navLinks = doc.DocumentNode
        .SelectNodes("//nav//a[@href] | //*[@class='navigation']//a[@href] | //*[@id='nav']//a[@href]")
        ?.Select(node => node.GetAttributeValue("href", ""))
        .Where(href => !string.IsNullOrEmpty(href))
        .ToList() ?? new List<string>();

    linksBySection["Navigation"] = navLinks;

    // Extract links from main content
    var contentLinks = doc.DocumentNode
        .SelectNodes("//main//a[@href] | //*[@class='content']//a[@href] | //*[@id='content']//a[@href]")
        ?.Select(node => node.GetAttributeValue("href", ""))
        .Where(href => !string.IsNullOrEmpty(href))
        .ToList() ?? new List<string>();

    linksBySection["Content"] = contentLinks;

    // Extract links from footer
    var footerLinks = doc.DocumentNode
        .SelectNodes("//footer//a[@href] | //*[@class='footer']//a[@href]")
        ?.Select(node => node.GetAttributeValue("href", ""))
        .Where(href => !string.IsNullOrEmpty(href))
        .ToList() ?? new List<string>();

    linksBySection["Footer"] = footerLinks;

    return linksBySection;
}

Installation and Setup

To get started with Html Agility Pack, install it via NuGet:

# Install Html Agility Pack
Install-Package HtmlAgilityPack

# Optional: Install Fizzler for CSS selector support
Install-Package Fizzler.Systems.HtmlAgilityPack

<!-- In your .csproj file -->
<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />
<PackageReference Include="Fizzler.Systems.HtmlAgilityPack" Version="1.2.1" />

Best Practices and Performance Tips

Reuse HtmlWeb instances for multiple requests to maintain session state
Set appropriate timeouts to prevent hanging requests
Use specific XPath expressions instead of selecting all elements when possible
Handle HTML entities properly using HtmlEntity.DeEntitize()
Validate URLs before processing to avoid exceptions
Implement retry logic for failed requests in production scenarios

Comparison with Other Tools

While Html Agility Pack is excellent for server-side C# applications, you might also consider:

Puppeteer: For JavaScript-rendered content where handling dynamic page navigation is crucial
Selenium: When you need full browser automation capabilities
AngleSharp: A pure C# alternative with W3C compliance

For scenarios requiring JavaScript execution or complex user interactions, tools like Puppeteer become essential for handling AJAX requests and dynamic content.

Conclusion

Html Agility Pack provides a robust and efficient way to extract links from HTML documents in C# applications. Whether you need simple link extraction or complex filtering and categorization, the library offers the flexibility to handle various scenarios. Remember to implement proper error handling and validation to ensure your link extraction code is production-ready.

The examples provided cover the most common use cases, from basic link extraction to advanced filtering and categorization. Choose the approach that best fits your specific requirements and always test with real-world HTML documents to ensure reliability.

Table of contents

How do I extract all links from an HTML document using Html Agility Pack?

Basic Link Extraction

Alternative CSS Selector Approach

Extracting Detailed Link Information

Filtering Links by Type

Loading HTML from Different Sources

Error Handling and Validation

Advanced Link Processing

Installation and Setup

Best Practices and Performance Tips

Comparison with Other Tools

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the HtmlDocument class and how do I use it?

How do I handle XML namespaces when parsing HTML with Html Agility Pack?

Can I use Html Agility Pack to validate HTML structure?

Get Started Now

Support