How do I extract all links from an HTML document using Html Agility Pack?
Extracting links from HTML documents is one of the most common web scraping tasks. Html Agility Pack provides powerful methods to parse HTML and extract anchor (<a>
) tags efficiently. This guide covers various approaches to extract links, handle different link types, and process link attributes.
Basic Link Extraction
The simplest way to extract all links is to select all anchor tags and retrieve their href
attributes:
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
// Load HTML from URL
var web = new HtmlWeb();
var doc = web.Load("https://example.com");
// Extract all links using XPath
var links = doc.DocumentNode
.SelectNodes("//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => !string.IsNullOrEmpty(href))
.ToList() ?? new List<string>();
foreach (var link in links)
{
Console.WriteLine(link);
}
Alternative CSS Selector Approach
You can also use CSS-like selectors with Html Agility Pack extensions:
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
var web = new HtmlWeb();
var doc = web.Load("https://example.com");
// Using CSS selector (requires Fizzler.Systems.HtmlAgilityPack NuGet package)
var linkNodes = doc.DocumentNode.QuerySelectorAll("a[href]");
var links = linkNodes
.Select(node => node.GetAttributeValue("href", ""))
.Where(href => !string.IsNullOrEmpty(href))
.ToList();
Extracting Detailed Link Information
For more comprehensive link extraction, collect additional attributes like link text, title, and target:
public class LinkInfo
{
public string Url { get; set; }
public string Text { get; set; }
public string Title { get; set; }
public string Target { get; set; }
public bool IsExternal { get; set; }
}
public static List<LinkInfo> ExtractDetailedLinks(HtmlDocument doc, string baseUrl = "")
{
var links = new List<LinkInfo>();
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes == null) return links;
var baseUri = string.IsNullOrEmpty(baseUrl) ? null : new Uri(baseUrl);
foreach (var node in linkNodes)
{
var href = node.GetAttributeValue("href", "");
if (string.IsNullOrEmpty(href)) continue;
var linkInfo = new LinkInfo
{
Url = href,
Text = node.InnerText?.Trim() ?? "",
Title = node.GetAttributeValue("title", ""),
Target = node.GetAttributeValue("target", "")
};
// Determine if link is external
if (baseUri != null && Uri.TryCreate(baseUri, href, out var absoluteUri))
{
linkInfo.Url = absoluteUri.ToString();
linkInfo.IsExternal = !absoluteUri.Host.Equals(baseUri.Host, StringComparison.OrdinalIgnoreCase);
}
else if (href.StartsWith("http"))
{
linkInfo.IsExternal = true;
}
links.Add(linkInfo);
}
return links;
}
// Usage
var web = new HtmlWeb();
var doc = web.Load("https://example.com");
var detailedLinks = ExtractDetailedLinks(doc, "https://example.com");
foreach (var link in detailedLinks)
{
Console.WriteLine($"URL: {link.Url}");
Console.WriteLine($"Text: {link.Text}");
Console.WriteLine($"External: {link.IsExternal}");
Console.WriteLine("---");
}
Filtering Links by Type
Extract specific types of links based on URL patterns or attributes:
public static class LinkExtractor
{
// Extract only external links
public static List<string> ExtractExternalLinks(HtmlDocument doc, string baseHost)
{
return doc.DocumentNode
.SelectNodes("//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => href.StartsWith("http") && !href.Contains(baseHost))
.ToList() ?? new List<string>();
}
// Extract only internal links
public static List<string> ExtractInternalLinks(HtmlDocument doc)
{
return doc.DocumentNode
.SelectNodes("//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => !href.StartsWith("http") && !href.StartsWith("mailto:") && !href.StartsWith("tel:"))
.ToList() ?? new List<string>();
}
// Extract download links (files)
public static List<string> ExtractDownloadLinks(HtmlDocument doc)
{
var fileExtensions = new[] { ".pdf", ".doc", ".docx", ".xls", ".xlsx", ".zip", ".rar", ".exe" };
return doc.DocumentNode
.SelectNodes("//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => fileExtensions.Any(ext => href.EndsWith(ext, StringComparison.OrdinalIgnoreCase)))
.ToList() ?? new List<string>();
}
// Extract email links
public static List<string> ExtractEmailLinks(HtmlDocument doc)
{
return doc.DocumentNode
.SelectNodes("//a[starts-with(@href, 'mailto:')]")
?.Select(node => node.GetAttributeValue("href", "").Replace("mailto:", ""))
.ToList() ?? new List<string>();
}
}
Loading HTML from Different Sources
Html Agility Pack supports loading HTML from various sources:
// From URL
var web = new HtmlWeb();
var doc = web.Load("https://example.com");
// From HTML string
string htmlContent = "<html><body><a href='https://example.com'>Link</a></body></html>";
var doc2 = new HtmlDocument();
doc2.LoadHtml(htmlContent);
// From file
var doc3 = new HtmlDocument();
doc3.Load("path/to/file.html");
// From stream
using var stream = new FileStream("path/to/file.html", FileMode.Open);
var doc4 = new HtmlDocument();
doc4.Load(stream);
Error Handling and Validation
Implement robust error handling when extracting links:
public static List<LinkInfo> SafeExtractLinks(string url, int timeoutSeconds = 30)
{
var links = new List<LinkInfo>();
try
{
var web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36";
web.Timeout = timeoutSeconds * 1000;
var doc = web.Load(url);
if (doc == null)
{
Console.WriteLine("Failed to load document");
return links;
}
var linkNodes = doc.DocumentNode?.SelectNodes("//a[@href]");
if (linkNodes == null)
{
Console.WriteLine("No links found in document");
return links;
}
foreach (var node in linkNodes)
{
try
{
var href = node.GetAttributeValue("href", "");
if (string.IsNullOrWhiteSpace(href)) continue;
// Validate URL format
if (Uri.TryCreate(url, href, out var validUri))
{
links.Add(new LinkInfo
{
Url = validUri.ToString(),
Text = HtmlEntity.DeEntitize(node.InnerText?.Trim() ?? ""),
Title = HtmlEntity.DeEntitize(node.GetAttributeValue("title", "")),
Target = node.GetAttributeValue("target", "")
});
}
}
catch (Exception ex)
{
Console.WriteLine($"Error processing link: {ex.Message}");
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Error loading page: {ex.Message}");
}
return links;
}
Advanced Link Processing
For complex scenarios, you might need to process links found within specific containers or with certain patterns:
public static Dictionary<string, List<string>> ExtractLinksBySection(HtmlDocument doc)
{
var linksBySection = new Dictionary<string, List<string>>();
// Extract links from navigation
var navLinks = doc.DocumentNode
.SelectNodes("//nav//a[@href] | //*[@class='navigation']//a[@href] | //*[@id='nav']//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => !string.IsNullOrEmpty(href))
.ToList() ?? new List<string>();
linksBySection["Navigation"] = navLinks;
// Extract links from main content
var contentLinks = doc.DocumentNode
.SelectNodes("//main//a[@href] | //*[@class='content']//a[@href] | //*[@id='content']//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => !string.IsNullOrEmpty(href))
.ToList() ?? new List<string>();
linksBySection["Content"] = contentLinks;
// Extract links from footer
var footerLinks = doc.DocumentNode
.SelectNodes("//footer//a[@href] | //*[@class='footer']//a[@href]")
?.Select(node => node.GetAttributeValue("href", ""))
.Where(href => !string.IsNullOrEmpty(href))
.ToList() ?? new List<string>();
linksBySection["Footer"] = footerLinks;
return linksBySection;
}
Installation and Setup
To get started with Html Agility Pack, install it via NuGet:
# Install Html Agility Pack
Install-Package HtmlAgilityPack
# Optional: Install Fizzler for CSS selector support
Install-Package Fizzler.Systems.HtmlAgilityPack
<!-- In your .csproj file -->
<PackageReference Include="HtmlAgilityPack" Version="1.11.54" />
<PackageReference Include="Fizzler.Systems.HtmlAgilityPack" Version="1.2.1" />
Best Practices and Performance Tips
- Reuse HtmlWeb instances for multiple requests to maintain session state
- Set appropriate timeouts to prevent hanging requests
- Use specific XPath expressions instead of selecting all elements when possible
- Handle HTML entities properly using
HtmlEntity.DeEntitize()
- Validate URLs before processing to avoid exceptions
- Implement retry logic for failed requests in production scenarios
Comparison with Other Tools
While Html Agility Pack is excellent for server-side C# applications, you might also consider:
- Puppeteer: For JavaScript-rendered content where handling dynamic page navigation is crucial
- Selenium: When you need full browser automation capabilities
- AngleSharp: A pure C# alternative with W3C compliance
For scenarios requiring JavaScript execution or complex user interactions, tools like Puppeteer become essential for handling AJAX requests and dynamic content.
Conclusion
Html Agility Pack provides a robust and efficient way to extract links from HTML documents in C# applications. Whether you need simple link extraction or complex filtering and categorization, the library offers the flexibility to handle various scenarios. Remember to implement proper error handling and validation to ensure your link extraction code is production-ready.
The examples provided cover the most common use cases, from basic link extraction to advanced filtering and categorization. Choose the approach that best fits your specific requirements and always test with real-world HTML documents to ensure reliability.