Table of contents

How do I handle relative URLs when scraping with Ruby?

When web scraping with Ruby, you'll frequently encounter relative URLs that need to be converted to absolute URLs for proper navigation and data extraction. Relative URLs are incomplete web addresses that depend on the current page's context to form a complete URL. This guide provides comprehensive techniques for handling relative URLs effectively in Ruby scraping projects.

Understanding Relative URLs

Relative URLs come in several forms:

  • Path-relative: /about (starts with /)
  • Document-relative: about.html (no leading /)
  • Protocol-relative: //example.com/page (starts with //)
  • Fragment-relative: #section (starts with #)
  • Query-relative: ?page=2 (starts with ?)

Using Ruby's URI Module

Ruby's built-in URI module provides the most reliable way to resolve relative URLs:

require 'uri'
require 'net/http'

# Base URL of the current page
base_url = 'https://example.com/products/category'

# Relative URLs found on the page
relative_urls = [
  '/about',
  'item1.html',
  '../contact',
  '//cdn.example.com/image.jpg',
  '?page=2'
]

# Convert relative URLs to absolute URLs
relative_urls.each do |relative_url|
  absolute_url = URI.join(base_url, relative_url)
  puts "#{relative_url} -> #{absolute_url}"
end

Output: /about -> https://example.com/about item1.html -> https://example.com/products/item1.html ../contact -> https://example.com/contact //cdn.example.com/image.jpg -> https://cdn.example.com/image.jpg ?page=2 -> https://example.com/products/category?page=2

Integrating with Nokogiri for HTML Parsing

When scraping with Nokogiri, you can automatically resolve relative URLs while parsing:

require 'nokogiri'
require 'open-uri'
require 'uri'

class RelativeURLScraper
  def initialize(base_url)
    @base_url = base_url
    @base_uri = URI.parse(base_url)
  end

  def scrape_links
    doc = Nokogiri::HTML(URI.open(@base_url))
    links = []

    doc.css('a[href]').each do |link|
      href = link['href']
      absolute_url = resolve_url(href)

      links << {
        text: link.text.strip,
        url: absolute_url.to_s,
        original_href: href
      }
    end

    links
  end

  def scrape_images
    doc = Nokogiri::HTML(URI.open(@base_url))
    images = []

    doc.css('img[src]').each do |img|
      src = img['src']
      absolute_url = resolve_url(src)

      images << {
        alt: img['alt'],
        url: absolute_url.to_s,
        original_src: src
      }
    end

    images
  end

  private

  def resolve_url(relative_url)
    return URI.parse(relative_url) if absolute_url?(relative_url)
    URI.join(@base_uri, relative_url)
  end

  def absolute_url?(url)
    uri = URI.parse(url)
    uri.absolute?
  rescue URI::InvalidURIError
    false
  end
end

# Usage example
scraper = RelativeURLScraper.new('https://example.com/blog/post-1')
links = scraper.scrape_links
images = scraper.scrape_images

puts "Found #{links.length} links and #{images.length} images"

Advanced URL Resolution with Error Handling

For production scraping, implement robust error handling and validation:

require 'uri'
require 'nokogiri'
require 'net/http'

class RobustURLResolver
  def self.resolve(base_url, relative_url)
    return nil if relative_url.nil? || relative_url.empty?

    # Clean the relative URL
    cleaned_url = relative_url.strip

    # Skip javascript: and mailto: URLs
    return nil if cleaned_url.match?(/^(javascript|mailto|tel):/i)

    # Skip fragment-only URLs if not needed
    return nil if cleaned_url.start_with?('#')

    begin
      base_uri = URI.parse(base_url)
      resolved_uri = URI.join(base_uri, cleaned_url)

      # Validate the resolved URL
      return resolved_uri.to_s if valid_http_url?(resolved_uri)
    rescue URI::InvalidURIError => e
      puts "Invalid URL: #{relative_url} - #{e.message}"
      return nil
    end

    nil
  end

  def self.valid_http_url?(uri)
    uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
  end
end

# Usage with error handling
base_url = 'https://example.com/page'
relative_urls = [
  '/valid-path',
  'javascript:void(0)',
  'mailto:test@example.com',
  '#fragment',
  'relative-page.html',
  '//cdn.example.com/asset.js'
]

relative_urls.each do |url|
  resolved = RobustURLResolver.resolve(base_url, url)
  if resolved
    puts "✓ #{url} -> #{resolved}"
  else
    puts "✗ Skipped: #{url}"
  end
end

Handling Base Tags in HTML

Some websites use <base> tags that affect relative URL resolution:

require 'nokogiri'
require 'uri'

class BaseAwareURLResolver
  def initialize(html_content, current_url)
    @doc = Nokogiri::HTML(html_content)
    @current_url = current_url
    @base_url = extract_base_url || current_url
  end

  def resolve_url(relative_url)
    URI.join(@base_url, relative_url).to_s
  end

  def extract_all_links
    @doc.css('a[href]').map do |link|
      {
        text: link.text.strip,
        href: link['href'],
        absolute_url: resolve_url(link['href'])
      }
    end
  end

  private

  def extract_base_url
    base_tag = @doc.at_css('base[href]')
    return nil unless base_tag

    base_href = base_tag['href']

    # Base href can also be relative
    if base_href.start_with?('http')
      base_href
    else
      URI.join(@current_url, base_href).to_s
    end
  end
end

# Example HTML with base tag
html = <<~HTML
  <html>
    <head>
      <base href="/api/v1/">
    </head>
    <body>
      <a href="users">Users</a>
      <a href="../docs">Documentation</a>
    </body>
  </html>
HTML

resolver = BaseAwareURLResolver.new(html, 'https://api.example.com/admin/panel')
links = resolver.extract_all_links

links.each do |link|
  puts "#{link[:href]} -> #{link[:absolute_url]}"
end

Working with Forms and Action URLs

Handle form action URLs which are commonly relative:

def extract_form_actions(html_content, base_url)
  doc = Nokogiri::HTML(html_content)
  forms = []

  doc.css('form').each do |form|
    action = form['action'] || ''
    method = (form['method'] || 'GET').upcase

    # Resolve relative action URL
    absolute_action = if action.empty?
      base_url  # Default to current page
    else
      URI.join(base_url, action).to_s
    end

    forms << {
      action: absolute_action,
      method: method,
      original_action: action
    }
  end

  forms
end

Best Practices for Production Scraping

1. URL Normalization

def normalize_url(url)
  uri = URI.parse(url)

  # Remove fragment
  uri.fragment = nil

  # Normalize path (remove . and .. segments)
  uri.path = uri.path.gsub(/\/+/, '/')  # Remove duplicate slashes

  # Sort query parameters for consistency
  if uri.query
    params = URI.decode_www_form(uri.query).sort
    uri.query = URI.encode_www_form(params)
  end

  uri.to_s
end

2. URL Filtering and Validation

class URLFilter
  EXCLUDED_EXTENSIONS = %w[.pdf .jpg .jpeg .png .gif .svg .css .js .ico].freeze

  def self.should_crawl?(url)
    uri = URI.parse(url)

    # Check for excluded file extensions
    return false if EXCLUDED_EXTENSIONS.any? { |ext| uri.path.downcase.end_with?(ext) }

    # Check for external domains (if crawling single domain)
    # return false unless uri.host == allowed_host

    # Check for common non-content URLs
    return false if uri.path.match?(/\/(api|ajax|json)\//i)

    true
  rescue URI::InvalidURIError
    false
  end
end

3. Complete Scraping Example

require 'nokogiri'
require 'uri'
require 'net/http'
require 'set'

class WebScraper
  def initialize(start_url)
    @start_url = start_url
    @visited_urls = Set.new
    @queue = [start_url]
  end

  def crawl(max_pages: 10)
    scraped_data = []

    while @queue.any? && scraped_data.length < max_pages
      current_url = @queue.shift
      next if @visited_urls.include?(current_url)

      puts "Scraping: #{current_url}"
      @visited_urls.add(current_url)

      begin
        page_data = scrape_page(current_url)
        scraped_data << page_data

        # Add new URLs to queue
        page_data[:links].each do |link|
          url = link[:absolute_url]
          @queue << url unless @visited_urls.include?(url)
        end

      rescue => e
        puts "Error scraping #{current_url}: #{e.message}"
      end

      sleep 1  # Be respectful to the server
    end

    scraped_data
  end

  private

  def scrape_page(url)
    html = fetch_html(url)
    doc = Nokogiri::HTML(html)

    {
      url: url,
      title: doc.at_css('title')&.text&.strip,
      links: extract_links(doc, url),
      content: doc.at_css('body')&.text&.strip
    }
  end

  def extract_links(doc, base_url)
    doc.css('a[href]').map do |link|
      href = link['href']
      absolute_url = URI.join(base_url, href).to_s

      {
        text: link.text.strip,
        absolute_url: absolute_url,
        original_href: href
      }
    end.select { |link| valid_url?(link[:absolute_url]) }
  end

  def fetch_html(url)
    uri = URI.parse(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = uri.scheme == 'https'

    request = Net::HTTP::Get.new(uri.path)
    response = http.request(request)

    response.body
  end

  def valid_url?(url)
    uri = URI.parse(url)
    uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
  rescue URI::InvalidURIError
    false
  end
end

# Usage
scraper = WebScraper.new('https://example.com')
results = scraper.crawl(max_pages: 5)
puts "Scraped #{results.length} pages"

Conclusion

Handling relative URLs properly is crucial for robust web scraping in Ruby. The URI module provides excellent built-in functionality for URL resolution, while proper error handling and validation ensure your scraper can handle real-world websites reliably. Whether you're building a simple link extractor or a comprehensive web crawler, these techniques will help you manage URLs effectively and avoid common pitfalls in web scraping projects.

For more advanced scraping scenarios, consider using specialized tools that can handle dynamic content and complex navigation patterns or implement proper session management for authenticated scraping workflows.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon