How can I extract all links from a webpage using Nokogiri?
Extracting links from web pages is one of the most common web scraping tasks, and Nokogiri provides powerful tools to accomplish this efficiently. Whether you're building a web crawler, analyzing site structure, or collecting URLs for further processing, Nokogiri offers multiple approaches to extract links with precision and flexibility.
Basic Link Extraction with CSS Selectors
The simplest way to extract all links from a webpage using Nokogiri is to use CSS selectors to target <a>
tags with href
attributes:
require 'nokogiri'
require 'open-uri'
# Fetch and parse the webpage
url = 'https://example.com'
doc = Nokogiri::HTML(URI.open(url))
# Extract all links using CSS selector
links = doc.css('a[href]').map { |link| link['href'] }
# Display the results
links.each_with_index do |link, index|
puts "#{index + 1}. #{link}"
end
This approach selects all anchor tags that have an href
attribute and extracts the URL values. The css('a[href]')
selector ensures you only get links that actually have destinations.
Advanced Link Extraction with Detailed Information
For more comprehensive link analysis, you might want to extract additional information alongside the URLs:
require 'nokogiri'
require 'open-uri'
def extract_detailed_links(url)
doc = Nokogiri::HTML(URI.open(url))
links_data = []
doc.css('a[href]').each do |link|
link_info = {
url: link['href'],
text: link.text.strip,
title: link['title'],
target: link['target'],
rel: link['rel'],
class: link['class']
}
links_data << link_info
end
links_data
end
# Usage
url = 'https://example.com'
detailed_links = extract_detailed_links(url)
detailed_links.each do |link|
puts "URL: #{link[:url]}"
puts "Text: #{link[:text]}"
puts "Title: #{link[:title]}" if link[:title]
puts "---"
end
Using XPath for Link Extraction
XPath provides an alternative method for selecting links, offering more complex selection capabilities:
require 'nokogiri'
require 'open-uri'
# Parse the webpage
doc = Nokogiri::HTML(URI.open('https://example.com'))
# Extract links using XPath
links = doc.xpath('//a[@href]').map { |link| link['href'] }
# More specific XPath examples
external_links = doc.xpath('//a[starts-with(@href, "http")]/@href').map(&:value)
internal_links = doc.xpath('//a[starts-with(@href, "/")]/@href').map(&:value)
email_links = doc.xpath('//a[starts-with(@href, "mailto:")]/@href').map(&:value)
puts "External links: #{external_links.count}"
puts "Internal links: #{internal_links.count}"
puts "Email links: #{email_links.count}"
Filtering and Categorizing Links
Often, you'll need to filter links based on specific criteria. Here's how to categorize different types of links:
require 'nokogiri'
require 'open-uri'
require 'uri'
def categorize_links(url)
doc = Nokogiri::HTML(URI.open(url))
base_uri = URI.parse(url)
categories = {
external: [],
internal: [],
email: [],
phone: [],
anchor: [],
file_downloads: []
}
doc.css('a[href]').each do |link|
href = link['href'].strip
case href
when /^mailto:/
categories[:email] << href
when /^tel:/
categories[:phone] << href
when /^#/
categories[:anchor] << href
when /\.(pdf|doc|docx|xls|xlsx|zip|rar)$/i
categories[:file_downloads] << href
when /^https?:\/\//
link_uri = URI.parse(href)
if link_uri.host == base_uri.host
categories[:internal] << href
else
categories[:external] << href
end
when /^\//
categories[:internal] << href
else
# Relative links
categories[:internal] << href
end
end
categories
end
# Usage
categorized = categorize_links('https://example.com')
categorized.each do |category, links|
puts "#{category.to_s.capitalize}: #{links.count} links"
end
Handling Relative URLs
When extracting links, you'll often encounter relative URLs that need to be converted to absolute URLs:
require 'nokogiri'
require 'open-uri'
require 'uri'
def extract_absolute_links(url)
doc = Nokogiri::HTML(URI.open(url))
base_uri = URI.parse(url)
absolute_links = []
doc.css('a[href]').each do |link|
href = link['href']
begin
# Convert relative URLs to absolute
absolute_url = URI.join(base_uri, href).to_s
absolute_links << absolute_url
rescue URI::InvalidURIError
# Skip invalid URLs
puts "Skipping invalid URL: #{href}"
end
end
absolute_links.uniq
end
# Usage
absolute_links = extract_absolute_links('https://example.com')
puts "Found #{absolute_links.count} unique absolute links"
Advanced Filtering with Custom Methods
For complex link extraction scenarios, you can create custom filtering methods:
require 'nokogiri'
require 'open-uri'
class LinkExtractor
def initialize(url)
@doc = Nokogiri::HTML(URI.open(url))
@base_url = url
end
def extract_links_by_text(pattern)
@doc.css('a[href]').select do |link|
link.text.match?(pattern)
end.map { |link| link['href'] }
end
def extract_links_by_domain(domain)
@doc.css('a[href]').select do |link|
href = link['href']
href.include?(domain) if href
end.map { |link| link['href'] }
end
def extract_navigation_links
@doc.css('nav a[href], .navigation a[href], .menu a[href]').map do |link|
{
url: link['href'],
text: link.text.strip
}
end
end
def extract_content_links(exclude_nav: true)
selector = if exclude_nav
'a[href]:not(nav a):not(.navigation a):not(.menu a)'
else
'a[href]'
end
@doc.css(selector).map do |link|
{
url: link['href'],
text: link.text.strip,
context: link.parent.name
}
end
end
end
# Usage
extractor = LinkExtractor.new('https://example.com')
# Extract links containing specific text
blog_links = extractor.extract_links_by_text(/blog|article|post/i)
puts "Blog-related links: #{blog_links.count}"
# Extract navigation links
nav_links = extractor.extract_navigation_links
puts "Navigation links found: #{nav_links.count}"
Error Handling and Robust Extraction
When extracting links from real-world websites, it's important to handle errors gracefully:
require 'nokogiri'
require 'open-uri'
require 'timeout'
def robust_link_extraction(url, timeout_seconds: 30)
links = []
begin
Timeout::timeout(timeout_seconds) do
doc = Nokogiri::HTML(URI.open(url, {
'User-Agent' => 'Mozilla/5.0 (compatible; LinkExtractor/1.0)'
}))
doc.css('a[href]').each do |link|
href = link['href']
next if href.nil? || href.empty?
# Clean and validate the link
cleaned_href = href.strip
next if cleaned_href.start_with?('javascript:', 'data:')
links << {
url: cleaned_href,
text: link.text.strip.gsub(/\s+/, ' '),
anchor_text: link.text.strip
}
end
end
rescue Timeout::Error
puts "Timeout error: Request took longer than #{timeout_seconds} seconds"
rescue OpenURI::HTTPError => e
puts "HTTP error: #{e.message}"
rescue SocketError => e
puts "Network error: #{e.message}"
rescue StandardError => e
puts "Unexpected error: #{e.message}"
end
links.uniq { |link| link[:url] }
end
# Usage with error handling
links = robust_link_extraction('https://example.com')
puts "Successfully extracted #{links.count} links"
Performance Optimization for Large Pages
For pages with many links, you can optimize extraction performance:
require 'nokogiri'
require 'open-uri'
def optimized_link_extraction(url)
# Use SAX parser for memory efficiency on large documents
links = []
class LinkHandler < Nokogiri::XML::SAX::Document
attr_reader :links
def initialize
@links = []
@current_element = nil
end
def start_element(name, attributes = [])
if name == 'a'
@current_element = Hash[attributes]
@current_text = ''
end
end
def characters(string)
@current_text += string if @current_element
end
def end_element(name)
if name == 'a' && @current_element && @current_element['href']
@links << {
url: @current_element['href'],
text: @current_text.strip
}
@current_element = nil
end
end
end
# Alternative: Use standard parsing but with streaming
doc = Nokogiri::HTML(URI.open(url)) { |config| config.noblanks }
# Extract links efficiently using XPath
doc.xpath('//a[@href]').map do |link|
{
url: link['href'],
text: link.content.strip
}
end
end
Integration with Web Scraping Workflows
When building larger web scraping applications, link extraction often serves as the foundation for crawling multiple pages. While Nokogiri excels at parsing static HTML content, you might need to combine it with other tools for JavaScript-heavy sites. For handling dynamic content that loads after the initial page load, consider how to handle AJAX requests using Puppeteer or explore navigating to different pages using Puppeteer for comprehensive crawling solutions.
Conclusion
Extracting links with Nokogiri is straightforward and powerful, offering multiple approaches from simple CSS selectors to complex XPath expressions. The key to successful link extraction lies in understanding your specific requirements: whether you need all links, specific types of links, or detailed metadata about each link.
Remember to handle errors gracefully, respect website policies, and consider the performance implications when working with large pages. With these techniques, you'll be able to efficiently extract and process links for any Ruby-based web scraping project.
The combination of Nokogiri's parsing capabilities with Ruby's string manipulation and URI handling makes it an excellent choice for link extraction tasks, whether you're building a simple link checker or a complex web crawler.