How do I handle relative URLs when scraping with Ruby?
When web scraping with Ruby, you'll frequently encounter relative URLs that need to be converted to absolute URLs for proper navigation and data extraction. Relative URLs are incomplete web addresses that depend on the current page's context to form a complete URL. This guide provides comprehensive techniques for handling relative URLs effectively in Ruby scraping projects.
Understanding Relative URLs
Relative URLs come in several forms:
- Path-relative:
/about
(starts with/
) - Document-relative:
about.html
(no leading/
) - Protocol-relative:
//example.com/page
(starts with//
) - Fragment-relative:
#section
(starts with#
) - Query-relative:
?page=2
(starts with?
)
Using Ruby's URI Module
Ruby's built-in URI
module provides the most reliable way to resolve relative URLs:
require 'uri'
require 'net/http'
# Base URL of the current page
base_url = 'https://example.com/products/category'
# Relative URLs found on the page
relative_urls = [
'/about',
'item1.html',
'../contact',
'//cdn.example.com/image.jpg',
'?page=2'
]
# Convert relative URLs to absolute URLs
relative_urls.each do |relative_url|
absolute_url = URI.join(base_url, relative_url)
puts "#{relative_url} -> #{absolute_url}"
end
Output:
/about -> https://example.com/about
item1.html -> https://example.com/products/item1.html
../contact -> https://example.com/contact
//cdn.example.com/image.jpg -> https://cdn.example.com/image.jpg
?page=2 -> https://example.com/products/category?page=2
Integrating with Nokogiri for HTML Parsing
When scraping with Nokogiri, you can automatically resolve relative URLs while parsing:
require 'nokogiri'
require 'open-uri'
require 'uri'
class RelativeURLScraper
def initialize(base_url)
@base_url = base_url
@base_uri = URI.parse(base_url)
end
def scrape_links
doc = Nokogiri::HTML(URI.open(@base_url))
links = []
doc.css('a[href]').each do |link|
href = link['href']
absolute_url = resolve_url(href)
links << {
text: link.text.strip,
url: absolute_url.to_s,
original_href: href
}
end
links
end
def scrape_images
doc = Nokogiri::HTML(URI.open(@base_url))
images = []
doc.css('img[src]').each do |img|
src = img['src']
absolute_url = resolve_url(src)
images << {
alt: img['alt'],
url: absolute_url.to_s,
original_src: src
}
end
images
end
private
def resolve_url(relative_url)
return URI.parse(relative_url) if absolute_url?(relative_url)
URI.join(@base_uri, relative_url)
end
def absolute_url?(url)
uri = URI.parse(url)
uri.absolute?
rescue URI::InvalidURIError
false
end
end
# Usage example
scraper = RelativeURLScraper.new('https://example.com/blog/post-1')
links = scraper.scrape_links
images = scraper.scrape_images
puts "Found #{links.length} links and #{images.length} images"
Advanced URL Resolution with Error Handling
For production scraping, implement robust error handling and validation:
require 'uri'
require 'nokogiri'
require 'net/http'
class RobustURLResolver
def self.resolve(base_url, relative_url)
return nil if relative_url.nil? || relative_url.empty?
# Clean the relative URL
cleaned_url = relative_url.strip
# Skip javascript: and mailto: URLs
return nil if cleaned_url.match?(/^(javascript|mailto|tel):/i)
# Skip fragment-only URLs if not needed
return nil if cleaned_url.start_with?('#')
begin
base_uri = URI.parse(base_url)
resolved_uri = URI.join(base_uri, cleaned_url)
# Validate the resolved URL
return resolved_uri.to_s if valid_http_url?(resolved_uri)
rescue URI::InvalidURIError => e
puts "Invalid URL: #{relative_url} - #{e.message}"
return nil
end
nil
end
def self.valid_http_url?(uri)
uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
end
end
# Usage with error handling
base_url = 'https://example.com/page'
relative_urls = [
'/valid-path',
'javascript:void(0)',
'mailto:test@example.com',
'#fragment',
'relative-page.html',
'//cdn.example.com/asset.js'
]
relative_urls.each do |url|
resolved = RobustURLResolver.resolve(base_url, url)
if resolved
puts "✓ #{url} -> #{resolved}"
else
puts "✗ Skipped: #{url}"
end
end
Handling Base Tags in HTML
Some websites use <base>
tags that affect relative URL resolution:
require 'nokogiri'
require 'uri'
class BaseAwareURLResolver
def initialize(html_content, current_url)
@doc = Nokogiri::HTML(html_content)
@current_url = current_url
@base_url = extract_base_url || current_url
end
def resolve_url(relative_url)
URI.join(@base_url, relative_url).to_s
end
def extract_all_links
@doc.css('a[href]').map do |link|
{
text: link.text.strip,
href: link['href'],
absolute_url: resolve_url(link['href'])
}
end
end
private
def extract_base_url
base_tag = @doc.at_css('base[href]')
return nil unless base_tag
base_href = base_tag['href']
# Base href can also be relative
if base_href.start_with?('http')
base_href
else
URI.join(@current_url, base_href).to_s
end
end
end
# Example HTML with base tag
html = <<~HTML
<html>
<head>
<base href="/api/v1/">
</head>
<body>
<a href="users">Users</a>
<a href="../docs">Documentation</a>
</body>
</html>
HTML
resolver = BaseAwareURLResolver.new(html, 'https://api.example.com/admin/panel')
links = resolver.extract_all_links
links.each do |link|
puts "#{link[:href]} -> #{link[:absolute_url]}"
end
Working with Forms and Action URLs
Handle form action URLs which are commonly relative:
def extract_form_actions(html_content, base_url)
doc = Nokogiri::HTML(html_content)
forms = []
doc.css('form').each do |form|
action = form['action'] || ''
method = (form['method'] || 'GET').upcase
# Resolve relative action URL
absolute_action = if action.empty?
base_url # Default to current page
else
URI.join(base_url, action).to_s
end
forms << {
action: absolute_action,
method: method,
original_action: action
}
end
forms
end
Best Practices for Production Scraping
1. URL Normalization
def normalize_url(url)
uri = URI.parse(url)
# Remove fragment
uri.fragment = nil
# Normalize path (remove . and .. segments)
uri.path = uri.path.gsub(/\/+/, '/') # Remove duplicate slashes
# Sort query parameters for consistency
if uri.query
params = URI.decode_www_form(uri.query).sort
uri.query = URI.encode_www_form(params)
end
uri.to_s
end
2. URL Filtering and Validation
class URLFilter
EXCLUDED_EXTENSIONS = %w[.pdf .jpg .jpeg .png .gif .svg .css .js .ico].freeze
def self.should_crawl?(url)
uri = URI.parse(url)
# Check for excluded file extensions
return false if EXCLUDED_EXTENSIONS.any? { |ext| uri.path.downcase.end_with?(ext) }
# Check for external domains (if crawling single domain)
# return false unless uri.host == allowed_host
# Check for common non-content URLs
return false if uri.path.match?(/\/(api|ajax|json)\//i)
true
rescue URI::InvalidURIError
false
end
end
3. Complete Scraping Example
require 'nokogiri'
require 'uri'
require 'net/http'
require 'set'
class WebScraper
def initialize(start_url)
@start_url = start_url
@visited_urls = Set.new
@queue = [start_url]
end
def crawl(max_pages: 10)
scraped_data = []
while @queue.any? && scraped_data.length < max_pages
current_url = @queue.shift
next if @visited_urls.include?(current_url)
puts "Scraping: #{current_url}"
@visited_urls.add(current_url)
begin
page_data = scrape_page(current_url)
scraped_data << page_data
# Add new URLs to queue
page_data[:links].each do |link|
url = link[:absolute_url]
@queue << url unless @visited_urls.include?(url)
end
rescue => e
puts "Error scraping #{current_url}: #{e.message}"
end
sleep 1 # Be respectful to the server
end
scraped_data
end
private
def scrape_page(url)
html = fetch_html(url)
doc = Nokogiri::HTML(html)
{
url: url,
title: doc.at_css('title')&.text&.strip,
links: extract_links(doc, url),
content: doc.at_css('body')&.text&.strip
}
end
def extract_links(doc, base_url)
doc.css('a[href]').map do |link|
href = link['href']
absolute_url = URI.join(base_url, href).to_s
{
text: link.text.strip,
absolute_url: absolute_url,
original_href: href
}
end.select { |link| valid_url?(link[:absolute_url]) }
end
def fetch_html(url)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
request = Net::HTTP::Get.new(uri.path)
response = http.request(request)
response.body
end
def valid_url?(url)
uri = URI.parse(url)
uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
rescue URI::InvalidURIError
false
end
end
# Usage
scraper = WebScraper.new('https://example.com')
results = scraper.crawl(max_pages: 5)
puts "Scraped #{results.length} pages"
Conclusion
Handling relative URLs properly is crucial for robust web scraping in Ruby. The URI
module provides excellent built-in functionality for URL resolution, while proper error handling and validation ensure your scraper can handle real-world websites reliably. Whether you're building a simple link extractor or a comprehensive web crawler, these techniques will help you manage URLs effectively and avoid common pitfalls in web scraping projects.
For more advanced scraping scenarios, consider using specialized tools that can handle dynamic content and complex navigation patterns or implement proper session management for authenticated scraping workflows.