Table of contents

How can I extract all links from a webpage using Nokogiri?

Extracting links from web pages is one of the most common web scraping tasks, and Nokogiri provides powerful tools to accomplish this efficiently. Whether you're building a web crawler, analyzing site structure, or collecting URLs for further processing, Nokogiri offers multiple approaches to extract links with precision and flexibility.

Basic Link Extraction with CSS Selectors

The simplest way to extract all links from a webpage using Nokogiri is to use CSS selectors to target <a> tags with href attributes:

require 'nokogiri'
require 'open-uri'

# Fetch and parse the webpage
url = 'https://example.com'
doc = Nokogiri::HTML(URI.open(url))

# Extract all links using CSS selector
links = doc.css('a[href]').map { |link| link['href'] }

# Display the results
links.each_with_index do |link, index|
  puts "#{index + 1}. #{link}"
end

This approach selects all anchor tags that have an href attribute and extracts the URL values. The css('a[href]') selector ensures you only get links that actually have destinations.

Advanced Link Extraction with Detailed Information

For more comprehensive link analysis, you might want to extract additional information alongside the URLs:

require 'nokogiri'
require 'open-uri'

def extract_detailed_links(url)
  doc = Nokogiri::HTML(URI.open(url))

  links_data = []

  doc.css('a[href]').each do |link|
    link_info = {
      url: link['href'],
      text: link.text.strip,
      title: link['title'],
      target: link['target'],
      rel: link['rel'],
      class: link['class']
    }

    links_data << link_info
  end

  links_data
end

# Usage
url = 'https://example.com'
detailed_links = extract_detailed_links(url)

detailed_links.each do |link|
  puts "URL: #{link[:url]}"
  puts "Text: #{link[:text]}"
  puts "Title: #{link[:title]}" if link[:title]
  puts "---"
end

Using XPath for Link Extraction

XPath provides an alternative method for selecting links, offering more complex selection capabilities:

require 'nokogiri'
require 'open-uri'

# Parse the webpage
doc = Nokogiri::HTML(URI.open('https://example.com'))

# Extract links using XPath
links = doc.xpath('//a[@href]').map { |link| link['href'] }

# More specific XPath examples
external_links = doc.xpath('//a[starts-with(@href, "http")]/@href').map(&:value)
internal_links = doc.xpath('//a[starts-with(@href, "/")]/@href').map(&:value)
email_links = doc.xpath('//a[starts-with(@href, "mailto:")]/@href').map(&:value)

puts "External links: #{external_links.count}"
puts "Internal links: #{internal_links.count}"
puts "Email links: #{email_links.count}"

Filtering and Categorizing Links

Often, you'll need to filter links based on specific criteria. Here's how to categorize different types of links:

require 'nokogiri'
require 'open-uri'
require 'uri'

def categorize_links(url)
  doc = Nokogiri::HTML(URI.open(url))
  base_uri = URI.parse(url)

  categories = {
    external: [],
    internal: [],
    email: [],
    phone: [],
    anchor: [],
    file_downloads: []
  }

  doc.css('a[href]').each do |link|
    href = link['href'].strip

    case href
    when /^mailto:/
      categories[:email] << href
    when /^tel:/
      categories[:phone] << href
    when /^#/
      categories[:anchor] << href
    when /\.(pdf|doc|docx|xls|xlsx|zip|rar)$/i
      categories[:file_downloads] << href
    when /^https?:\/\//
      link_uri = URI.parse(href)
      if link_uri.host == base_uri.host
        categories[:internal] << href
      else
        categories[:external] << href
      end
    when /^\//
      categories[:internal] << href
    else
      # Relative links
      categories[:internal] << href
    end
  end

  categories
end

# Usage
categorized = categorize_links('https://example.com')
categorized.each do |category, links|
  puts "#{category.to_s.capitalize}: #{links.count} links"
end

Handling Relative URLs

When extracting links, you'll often encounter relative URLs that need to be converted to absolute URLs:

require 'nokogiri'
require 'open-uri'
require 'uri'

def extract_absolute_links(url)
  doc = Nokogiri::HTML(URI.open(url))
  base_uri = URI.parse(url)

  absolute_links = []

  doc.css('a[href]').each do |link|
    href = link['href']

    begin
      # Convert relative URLs to absolute
      absolute_url = URI.join(base_uri, href).to_s
      absolute_links << absolute_url
    rescue URI::InvalidURIError
      # Skip invalid URLs
      puts "Skipping invalid URL: #{href}"
    end
  end

  absolute_links.uniq
end

# Usage
absolute_links = extract_absolute_links('https://example.com')
puts "Found #{absolute_links.count} unique absolute links"

Advanced Filtering with Custom Methods

For complex link extraction scenarios, you can create custom filtering methods:

require 'nokogiri'
require 'open-uri'

class LinkExtractor
  def initialize(url)
    @doc = Nokogiri::HTML(URI.open(url))
    @base_url = url
  end

  def extract_links_by_text(pattern)
    @doc.css('a[href]').select do |link|
      link.text.match?(pattern)
    end.map { |link| link['href'] }
  end

  def extract_links_by_domain(domain)
    @doc.css('a[href]').select do |link|
      href = link['href']
      href.include?(domain) if href
    end.map { |link| link['href'] }
  end

  def extract_navigation_links
    @doc.css('nav a[href], .navigation a[href], .menu a[href]').map do |link|
      {
        url: link['href'],
        text: link.text.strip
      }
    end
  end

  def extract_content_links(exclude_nav: true)
    selector = if exclude_nav
      'a[href]:not(nav a):not(.navigation a):not(.menu a)'
    else
      'a[href]'
    end

    @doc.css(selector).map do |link|
      {
        url: link['href'],
        text: link.text.strip,
        context: link.parent.name
      }
    end
  end
end

# Usage
extractor = LinkExtractor.new('https://example.com')

# Extract links containing specific text
blog_links = extractor.extract_links_by_text(/blog|article|post/i)
puts "Blog-related links: #{blog_links.count}"

# Extract navigation links
nav_links = extractor.extract_navigation_links
puts "Navigation links found: #{nav_links.count}"

Error Handling and Robust Extraction

When extracting links from real-world websites, it's important to handle errors gracefully:

require 'nokogiri'
require 'open-uri'
require 'timeout'

def robust_link_extraction(url, timeout_seconds: 30)
  links = []

  begin
    Timeout::timeout(timeout_seconds) do
      doc = Nokogiri::HTML(URI.open(url, {
        'User-Agent' => 'Mozilla/5.0 (compatible; LinkExtractor/1.0)'
      }))

      doc.css('a[href]').each do |link|
        href = link['href']
        next if href.nil? || href.empty?

        # Clean and validate the link
        cleaned_href = href.strip
        next if cleaned_href.start_with?('javascript:', 'data:')

        links << {
          url: cleaned_href,
          text: link.text.strip.gsub(/\s+/, ' '),
          anchor_text: link.text.strip
        }
      end
    end

  rescue Timeout::Error
    puts "Timeout error: Request took longer than #{timeout_seconds} seconds"
  rescue OpenURI::HTTPError => e
    puts "HTTP error: #{e.message}"
  rescue SocketError => e
    puts "Network error: #{e.message}"
  rescue StandardError => e
    puts "Unexpected error: #{e.message}"
  end

  links.uniq { |link| link[:url] }
end

# Usage with error handling
links = robust_link_extraction('https://example.com')
puts "Successfully extracted #{links.count} links"

Performance Optimization for Large Pages

For pages with many links, you can optimize extraction performance:

require 'nokogiri'
require 'open-uri'

def optimized_link_extraction(url)
  # Use SAX parser for memory efficiency on large documents
  links = []

  class LinkHandler < Nokogiri::XML::SAX::Document
    attr_reader :links

    def initialize
      @links = []
      @current_element = nil
    end

    def start_element(name, attributes = [])
      if name == 'a'
        @current_element = Hash[attributes]
        @current_text = ''
      end
    end

    def characters(string)
      @current_text += string if @current_element
    end

    def end_element(name)
      if name == 'a' && @current_element && @current_element['href']
        @links << {
          url: @current_element['href'],
          text: @current_text.strip
        }
        @current_element = nil
      end
    end
  end

  # Alternative: Use standard parsing but with streaming
  doc = Nokogiri::HTML(URI.open(url)) { |config| config.noblanks }

  # Extract links efficiently using XPath
  doc.xpath('//a[@href]').map do |link|
    {
      url: link['href'],
      text: link.content.strip
    }
  end
end

Integration with Web Scraping Workflows

When building larger web scraping applications, link extraction often serves as the foundation for crawling multiple pages. While Nokogiri excels at parsing static HTML content, you might need to combine it with other tools for JavaScript-heavy sites. For handling dynamic content that loads after the initial page load, consider how to handle AJAX requests using Puppeteer or explore navigating to different pages using Puppeteer for comprehensive crawling solutions.

Conclusion

Extracting links with Nokogiri is straightforward and powerful, offering multiple approaches from simple CSS selectors to complex XPath expressions. The key to successful link extraction lies in understanding your specific requirements: whether you need all links, specific types of links, or detailed metadata about each link.

Remember to handle errors gracefully, respect website policies, and consider the performance implications when working with large pages. With these techniques, you'll be able to efficiently extract and process links for any Ruby-based web scraping project.

The combination of Nokogiri's parsing capabilities with Ruby's string manipulation and URI handling makes it an excellent choice for link extraction tasks, whether you're building a simple link checker or a complex web crawler.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon