What are the common anti-scraping measures and how do I handle them in Ruby?
Web scraping has become an essential tool for data extraction, but websites increasingly implement anti-scraping measures to protect their content and server resources. Understanding these measures and knowing how to handle them responsibly in Ruby is crucial for successful web scraping projects.
Common Anti-Scraping Measures
1. Rate Limiting and Request Throttling
Rate limiting is one of the most common anti-scraping measures. Websites monitor request frequency from individual IP addresses and block or slow down requests that exceed certain thresholds.
Ruby Solution:
require 'net/http'
require 'uri'
class RateLimitedScraper
def initialize(delay: 1)
@delay = delay
@last_request_time = Time.now - delay
end
def fetch(url)
# Ensure minimum delay between requests
sleep_time = @delay - (Time.now - @last_request_time)
sleep(sleep_time) if sleep_time > 0
uri = URI(url)
response = Net::HTTP.get_response(uri)
@last_request_time = Time.now
response
end
end
# Usage
scraper = RateLimitedScraper.new(delay: 2) # 2 seconds between requests
response = scraper.fetch('https://example.com')
For more advanced rate limiting with exponential backoff:
require 'httparty'
class SmartScraper
include HTTParty
def initialize
@retry_count = 0
@max_retries = 3
end
def fetch_with_retry(url)
begin
response = self.class.get(url)
if response.code == 429 # Too Many Requests
handle_rate_limit(url)
else
@retry_count = 0
response
end
rescue Net::TimeoutError, Net::ReadTimeout
retry_request(url)
end
end
private
def handle_rate_limit(url)
if @retry_count < @max_retries
wait_time = 2 ** @retry_count # Exponential backoff
puts "Rate limited. Waiting #{wait_time} seconds..."
sleep(wait_time)
@retry_count += 1
fetch_with_retry(url)
else
raise "Max retries exceeded for #{url}"
end
end
def retry_request(url)
if @retry_count < @max_retries
@retry_count += 1
puts "Retrying request #{@retry_count}/#{@max_retries}"
sleep(1)
fetch_with_retry(url)
else
raise "Max retries exceeded"
end
end
end
2. User Agent Detection
Many websites block requests with default or suspicious user agent strings. Ruby's default HTTP libraries often use identifiable user agents.
Ruby Solution:
require 'httparty'
class UserAgentRotator
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
].freeze
def self.random_user_agent
USER_AGENTS.sample
end
end
# Using with HTTParty
class WebScraper
include HTTParty
def fetch(url)
options = {
headers: {
'User-Agent' => UserAgentRotator.random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive'
}
}
self.class.get(url, options)
end
end
3. Session and Cookie Management
Some websites require maintaining sessions and handling cookies properly to access content.
Ruby Solution:
require 'httparty'
require 'http-cookie'
class SessionScraper
include HTTParty
def initialize
@jar = HTTP::CookieJar.new
self.class.maintain_method_across_redirects = true
end
def login(login_url, username, password)
# Get login page to extract CSRF tokens
login_page = self.class.get(login_url, headers: default_headers)
# Parse CSRF token (example with Nokogiri)
doc = Nokogiri::HTML(login_page.body)
csrf_token = doc.css('input[name="authenticity_token"]').first&.attr('value')
# Store cookies from login page
store_cookies(login_page)
# Submit login form
login_data = {
'username' => username,
'password' => password,
'authenticity_token' => csrf_token
}
response = self.class.post(login_url, {
body: login_data,
headers: default_headers.merge('Cookie' => cookie_string),
follow_redirects: true
})
store_cookies(response)
response
end
def fetch(url)
self.class.get(url, headers: default_headers.merge('Cookie' => cookie_string))
end
private
def default_headers
{
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate'
}
end
def store_cookies(response)
if response.headers['set-cookie']
Array(response.headers['set-cookie']).each do |cookie|
@jar.parse(cookie, response.request.last_uri)
end
end
end
def cookie_string
@jar.cookies.map { |cookie| "#{cookie.name}=#{cookie.value}" }.join('; ')
end
end
4. JavaScript-Rendered Content
Modern websites often load content dynamically with JavaScript, making it invisible to traditional HTTP-based scrapers.
Ruby Solution with Selenium:
require 'selenium-webdriver'
require 'nokogiri'
class JavaScriptScraper
def initialize(headless: true)
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless') if headless
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
@driver = Selenium::WebDriver.for :chrome, options: options
@driver.manage.timeouts.implicit_wait = 10
end
def fetch_with_js(url, wait_for_element: nil)
@driver.get(url)
# Wait for specific element if specified
if wait_for_element
wait = Selenium::WebDriver::Wait.new(timeout: 20)
wait.until { @driver.find_element(css: wait_for_element) }
else
# Default wait for page load
sleep(3)
end
# Handle infinite scroll if needed
handle_infinite_scroll if infinite_scroll_page?
@driver.page_source
end
def close
@driver.quit
end
private
def infinite_scroll_page?
# Check if page has infinite scroll elements
@driver.find_elements(css: '[data-infinite-scroll], .infinite-scroll').any?
end
def handle_infinite_scroll
last_height = @driver.execute_script("return document.body.scrollHeight")
loop do
@driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
new_height = @driver.execute_script("return document.body.scrollHeight")
break if new_height == last_height
last_height = new_height
end
end
end
# Usage
scraper = JavaScriptScraper.new(headless: true)
html = scraper.fetch_with_js('https://example.com', wait_for_element: '.content')
doc = Nokogiri::HTML(html)
scraper.close
5. IP Blocking and Geographic Restrictions
Websites may block IP addresses or restrict access based on geographic location.
Ruby Solution with Proxy Support:
require 'httparty'
require 'socksify/http'
class ProxyScraper
include HTTParty
PROXY_LIST = [
{ host: 'proxy1.example.com', port: 8080, user: 'username', pass: 'password' },
{ host: 'proxy2.example.com', port: 8080, user: 'username', pass: 'password' }
].freeze
def initialize
@current_proxy_index = 0
end
def fetch_with_proxy(url)
proxy = PROXY_LIST[@current_proxy_index]
options = {
http_proxyaddr: proxy[:host],
http_proxyport: proxy[:port],
http_proxyuser: proxy[:user],
http_proxypass: proxy[:pass],
headers: {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
}
begin
response = self.class.get(url, options)
if response.code != 200
rotate_proxy
retry if @retry_count < 3
end
response
rescue => e
puts "Proxy error: #{e.message}"
rotate_proxy
retry if @retry_count < 3
raise e
end
end
private
def rotate_proxy
@current_proxy_index = (@current_proxy_index + 1) % PROXY_LIST.length
@retry_count = (@retry_count || 0) + 1
puts "Rotating to proxy #{@current_proxy_index + 1}"
end
end
6. CAPTCHA Challenges
CAPTCHAs are designed to distinguish between humans and bots, requiring manual intervention or third-party services.
Ruby Solution:
require 'httparty'
require 'base64'
class CaptchaSolver
def initialize(api_key)
@api_key = api_key
@base_url = 'http://2captcha.com'
end
def solve_captcha(captcha_image_url, site_key = nil)
if site_key
solve_recaptcha(site_key)
else
solve_image_captcha(captcha_image_url)
end
end
private
def solve_image_captcha(image_url)
# Download image
image_data = HTTParty.get(image_url).body
image_base64 = Base64.encode64(image_data)
# Submit to 2captcha
submit_response = HTTParty.post("#{@base_url}/in.php", {
body: {
key: @api_key,
method: 'base64',
body: image_base64
}
})
if submit_response.body.include?('OK|')
captcha_id = submit_response.body.split('|')[1]
get_captcha_result(captcha_id)
else
raise "Captcha submission failed: #{submit_response.body}"
end
end
def solve_recaptcha(site_key)
# Implementation for reCAPTCHA solving
# This requires the page URL where the captcha appears
end
def get_captcha_result(captcha_id)
30.times do
sleep(5)
result = HTTParty.get("#{@base_url}/res.php", {
query: { key: @api_key, action: 'get', id: captcha_id }
})
if result.body.include?('OK|')
return result.body.split('|')[1]
elsif result.body == 'CAPCHA_NOT_READY'
next
else
raise "Captcha solving failed: #{result.body}"
end
end
raise "Captcha solving timeout"
end
end
Best Practices for Ethical Scraping
1. Respect robots.txt
Always check and respect the website's robots.txt file:
require 'httparty'
require 'uri'
class RobotsTxtChecker
def self.can_fetch?(url, user_agent = '*')
uri = URI(url)
robots_url = "#{uri.scheme}://#{uri.host}/robots.txt"
begin
robots_content = HTTParty.get(robots_url).body
parse_robots_txt(robots_content, uri.path, user_agent)
rescue
# If robots.txt is not accessible, assume scraping is allowed
true
end
end
private
def self.parse_robots_txt(content, path, user_agent)
current_user_agent = nil
disallowed_paths = []
content.lines.each do |line|
line = line.strip.downcase
if line.start_with?('user-agent:')
current_user_agent = line.split(':', 2)[1].strip
elsif line.start_with?('disallow:') &&
(current_user_agent == user_agent.downcase || current_user_agent == '*')
disallowed_path = line.split(':', 2)[1].strip
disallowed_paths << disallowed_path unless disallowed_path.empty?
end
end
# Check if the path is disallowed
disallowed_paths.none? { |disallowed| path.start_with?(disallowed) }
end
end
# Usage
if RobotsTxtChecker.can_fetch?('https://example.com/products')
# Proceed with scraping
else
puts "Scraping not allowed according to robots.txt"
end
2. Implement Comprehensive Error Handling
class RobustScraper
def initialize
@max_retries = 3
@retry_delay = 2
end
def fetch_with_error_handling(url)
retries = 0
begin
response = HTTParty.get(url, timeout: 30)
case response.code
when 200
response
when 403, 429
handle_blocked_request(url, retries)
when 404
raise "Page not found: #{url}"
when 500..599
raise "Server error: #{response.code}"
else
raise "Unexpected response code: #{response.code}"
end
rescue Net::TimeoutError, Net::ReadTimeout
retries += 1
if retries <= @max_retries
puts "Timeout error. Retrying #{retries}/#{@max_retries}..."
sleep(@retry_delay * retries)
retry
else
raise "Max retries exceeded due to timeout"
end
rescue => e
puts "Error fetching #{url}: #{e.message}"
raise e
end
end
private
def handle_blocked_request(url, retries)
if retries < @max_retries
wait_time = @retry_delay * (2 ** retries) # Exponential backoff
puts "Request blocked. Waiting #{wait_time} seconds..."
sleep(wait_time)
retries += 1
fetch_with_error_handling(url)
else
raise "Access blocked after #{@max_retries} retries"
end
end
end
Advanced Anti-Scraping Countermeasures
For complex scenarios, consider integrating with headless browser automation tools. While this guide focuses on Ruby solutions, you might also benefit from understanding how to handle dynamic content that loads after page load when dealing with JavaScript-heavy websites.
Browser Fingerprinting Defense
require 'selenium-webdriver'
class StealthScraper
def initialize
options = setup_stealth_options
@driver = Selenium::WebDriver.for :chrome, options: options
modify_navigator_properties
end
private
def setup_stealth_options
options = Selenium::WebDriver::Chrome::Options.new
# Disable automation indicators
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', false)
# Randomize window size
window_sizes = ['1366,768', '1920,1080', '1440,900', '1536,864']
options.add_argument("--window-size=#{window_sizes.sample}")
# Additional stealth options
options.add_argument('--disable-web-security')
options.add_argument('--disable-features=VizDisplayCompositor')
options
end
def modify_navigator_properties
# Remove webdriver property
@driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
# Modify other navigator properties
@driver.execute_script("""
Object.defineProperty(navigator, 'plugins', {
get: () => [{name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'}]
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
""")
end
end
Monitoring and Maintenance
Regular monitoring is essential for maintaining successful scraping operations:
class ScrapingMonitor
def initialize
@success_count = 0
@error_count = 0
@start_time = Time.now
end
def log_success(url)
@success_count += 1
puts "[#{Time.now}] SUCCESS: #{url}"
end
def log_error(url, error)
@error_count += 1
puts "[#{Time.now}] ERROR: #{url} - #{error.message}"
end
def report_stats
runtime = Time.now - @start_time
success_rate = (@success_count.to_f / (@success_count + @error_count)) * 100
puts "\n=== Scraping Statistics ==="
puts "Runtime: #{runtime.to_i} seconds"
puts "Successful requests: #{@success_count}"
puts "Failed requests: #{@error_count}"
puts "Success rate: #{success_rate.round(2)}%"
puts "Requests per minute: #{((@success_count + @error_count) / (runtime / 60)).round(2)}"
end
end
Conclusion
Successfully handling anti-scraping measures in Ruby requires a combination of technical techniques and ethical considerations. Always prioritize respectful scraping practices, implement robust error handling, and stay updated with the latest anti-scraping trends. For more complex scenarios involving JavaScript-heavy websites, consider exploring browser automation techniques that can complement your Ruby-based scraping strategies.
Remember that while these techniques can help overcome technical challenges, always ensure your scraping activities comply with the website's terms of service and applicable laws. Consider reaching out to website owners for API access when possible, as this is often a more sustainable and reliable approach than web scraping.