How can I scrape Google Search results using Ruby and Nokogiri?
Scraping Google Search results with Ruby and Nokogiri is a common requirement for SEO analysis, competitive research, and data collection projects. While Google has sophisticated anti-bot measures, you can successfully extract search results using proper techniques and best practices.
Understanding Google Search Structure
Google Search results follow a consistent HTML structure that makes them accessible to Nokogiri parsing:
- Search results container:
.g
class contains individual results - Title links:
h3
tags within result containers - Descriptions:
.VwiC3b
or.s
classes contain snippets - URLs: Citation elements with
.iUh30
class
Basic Ruby Setup
First, install the required gems:
gem install nokogiri
gem install net-http
gem install uri
Or add them to your Gemfile:
gem 'nokogiri'
gem 'net-http'
Simple Google Search Scraper
Here's a basic implementation to get you started:
require 'nokogiri'
require 'net/http'
require 'uri'
require 'cgi'
class GoogleScraper
BASE_URL = 'https://www.google.com/search'
def initialize
@headers = {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
end
def search(query, num_results = 10)
url = build_search_url(query, num_results)
html = fetch_page(url)
parse_results(html)
end
private
def build_search_url(query, num_results)
params = {
'q' => query,
'num' => num_results,
'hl' => 'en'
}
query_string = params.map { |k, v| "#{k}=#{CGI.escape(v.to_s)}" }.join('&')
"#{BASE_URL}?#{query_string}"
end
def fetch_page(url)
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Get.new(uri)
@headers.each { |key, value| request[key] = value }
response = http.request(request)
response.body
end
def parse_results(html)
doc = Nokogiri::HTML(html)
results = []
doc.css('.g').each do |result|
title_element = result.css('h3').first
next unless title_element
title = title_element.text.strip
link_element = result.css('a').first
url = link_element['href'] if link_element
description_element = result.css('.VwiC3b, .s').first
description = description_element ? description_element.text.strip : ''
results << {
title: title,
url: url,
description: description
}
end
results
end
end
# Usage example
scraper = GoogleScraper.new
results = scraper.search('ruby programming', 20)
results.each_with_index do |result, index|
puts "#{index + 1}. #{result[:title]}"
puts " URL: #{result[:url]}"
puts " Description: #{result[:description][0..100]}..."
puts
end
Advanced Features and Improvements
Adding Request Delays and Randomization
To avoid being blocked, implement delays between requests:
class GoogleScraper
def initialize
@headers = {
'User-Agent' => random_user_agent,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive'
}
@delay_range = (1..3)
end
def search_with_delay(query, num_results = 10)
sleep(rand(@delay_range))
search(query, num_results)
end
private
def random_user_agent
agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
agents.sample
end
end
Handling Pagination
Extract results from multiple pages:
def search_multiple_pages(query, total_results = 100)
all_results = []
results_per_page = 10
pages_needed = (total_results / results_per_page.to_f).ceil
(0...pages_needed).each do |page|
start_index = page * results_per_page
url = build_search_url_with_start(query, results_per_page, start_index)
html = fetch_page(url)
page_results = parse_results(html)
all_results.concat(page_results)
# Add delay between pages
sleep(rand(2..4)) unless page == pages_needed - 1
end
all_results[0...total_results]
end
private
def build_search_url_with_start(query, num_results, start)
params = {
'q' => query,
'num' => num_results,
'start' => start,
'hl' => 'en'
}
query_string = params.map { |k, v| "#{k}=#{CGI.escape(v.to_s)}" }.join('&')
"#{BASE_URL}?#{query_string}"
end
Error Handling and Retry Logic
Implement robust error handling:
def fetch_page_with_retry(url, max_retries = 3)
retries = 0
begin
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.read_timeout = 10
http.open_timeout = 10
request = Net::HTTP::Get.new(uri)
@headers.each { |key, value| request[key] = value }
response = http.request(request)
case response.code.to_i
when 200
response.body
when 429, 503
raise "Rate limited or service unavailable"
else
raise "HTTP Error: #{response.code}"
end
rescue => e
retries += 1
if retries <= max_retries
sleep(2 ** retries) # Exponential backoff
retry
else
raise "Failed after #{max_retries} retries: #{e.message}"
end
end
end
Extracting Additional Data
Featured Snippets and Knowledge Panels
def parse_enhanced_results(html)
doc = Nokogiri::HTML(html)
results = {
organic_results: [],
featured_snippet: nil,
knowledge_panel: nil
}
# Extract featured snippet
featured_snippet = doc.css('.hgKElc, .LGOjhe').first
if featured_snippet
results[:featured_snippet] = {
text: featured_snippet.text.strip,
source: featured_snippet.css('cite').first&.text
}
end
# Extract knowledge panel
knowledge_panel = doc.css('.kp-blk').first
if knowledge_panel
results[:knowledge_panel] = {
title: knowledge_panel.css('h2').first&.text,
description: knowledge_panel.css('.kno-rdesc span').first&.text
}
end
# Extract organic results
doc.css('.g').each do |result|
title_element = result.css('h3').first
next unless title_element
title = title_element.text.strip
link_element = result.css('a').first
url = link_element['href'] if link_element
description_element = result.css('.VwiC3b, .s').first
description = description_element ? description_element.text.strip : ''
results[:organic_results] << {
title: title,
url: url,
description: description
}
end
results
end
Images and Rich Results
def extract_image_results(html)
doc = Nokogiri::HTML(html)
images = []
doc.css('.rg_i').each do |img_container|
img_element = img_container.css('img').first
next unless img_element
images << {
src: img_element['src'] || img_element['data-src'],
alt: img_element['alt'],
title: img_element['title']
}
end
images
end
Handling Anti-Bot Measures
Google employs several techniques to detect and block automated scraping. Here are strategies to mitigate these measures:
Proxy Rotation
class ProxyRotator
def initialize(proxy_list)
@proxies = proxy_list
@current_index = 0
end
def next_proxy
proxy = @proxies[@current_index]
@current_index = (@current_index + 1) % @proxies.length
proxy
end
end
# Usage in scraper
def fetch_with_proxy(url)
proxy = @proxy_rotator.next_proxy
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port, proxy[:host], proxy[:port])
http.use_ssl = true
# ... rest of request logic
end
Session Management
require 'net/http/persistent'
class PersistentGoogleScraper
def initialize
@http = Net::HTTP::Persistent.new name: 'google_scraper'
@http.headers['User-Agent'] = random_user_agent
end
def search(query)
uri = URI(build_search_url(query))
response = @http.request uri
parse_results(response.body)
end
def close
@http.shutdown
end
end
Best Practices and Legal Considerations
Rate Limiting
Always implement proper rate limiting to avoid overwhelming Google's servers:
class RateLimiter
def initialize(requests_per_minute = 10)
@requests_per_minute = requests_per_minute
@request_times = []
end
def wait_if_needed
now = Time.now
@request_times.reject! { |time| now - time > 60 }
if @request_times.length >= @requests_per_minute
sleep_time = 60 - (now - @request_times.first)
sleep(sleep_time) if sleep_time > 0
end
@request_times << now
end
end
Caching Results
Implement caching to reduce unnecessary requests:
require 'digest'
class CachedGoogleScraper < GoogleScraper
def initialize(cache_dir = './cache')
super()
@cache_dir = cache_dir
Dir.mkdir(@cache_dir) unless Dir.exist?(@cache_dir)
end
def search(query, num_results = 10)
cache_key = generate_cache_key(query, num_results)
cache_file = File.join(@cache_dir, "#{cache_key}.json")
if File.exist?(cache_file) && file_fresh?(cache_file)
JSON.parse(File.read(cache_file), symbolize_names: true)
else
results = super(query, num_results)
File.write(cache_file, results.to_json)
results
end
end
private
def generate_cache_key(query, num_results)
Digest::MD5.hexdigest("#{query}-#{num_results}")
end
def file_fresh?(file_path, max_age_hours = 24)
File.mtime(file_path) > Time.now - (max_age_hours * 3600)
end
end
Alternative Approaches
While Nokogiri is excellent for parsing HTML, consider these alternatives for more complex scenarios:
Using Watir for JavaScript-Heavy Pages
For pages requiring JavaScript execution, similar to how to handle authentication in Puppeteer, you can use Watir:
require 'watir'
browser = Watir::Browser.new :chrome, headless: true
browser.goto "https://www.google.com/search?q=#{CGI.escape(query)}"
browser.wait_until { browser.divs(class: 'g').any? }
results = browser.divs(class: 'g').map do |result|
{
title: result.h3.text,
url: result.link.href,
description: result.span(class: 'VwiC3b').text
}
end
browser.close
Handling Dynamic Content Loading
For more complex scenarios that require waiting for content to load, similar to how to handle timeouts in Puppeteer, you can implement intelligent waiting:
def wait_for_results(doc, timeout = 10)
start_time = Time.now
while Time.now - start_time < timeout
results = doc.css('.g')
return results if results.any?
sleep(0.5)
# Re-fetch page if needed
end
raise "Timeout waiting for search results"
end
Testing Your Scraper
Create comprehensive tests for your scraper:
require 'rspec'
require 'webmock/rspec'
RSpec.describe GoogleScraper do
let(:scraper) { GoogleScraper.new }
let(:sample_html) { File.read('spec/fixtures/google_results.html') }
before do
stub_request(:get, /google\.com/)
.to_return(status: 200, body: sample_html)
end
it 'extracts search results correctly' do
results = scraper.search('test query')
expect(results).to be_an(Array)
expect(results.first).to have_key(:title)
expect(results.first).to have_key(:url)
expect(results.first).to have_key(:description)
end
it 'handles empty results gracefully' do
stub_request(:get, /google\.com/)
.to_return(status: 200, body: '<html><body></body></html>')
results = scraper.search('nonexistent query')
expect(results).to be_empty
end
it 'respects rate limiting' do
rate_limiter = RateLimiter.new(2)
start_time = Time.now
3.times do
rate_limiter.wait_if_needed
scraper.search('test')
end
elapsed = Time.now - start_time
expect(elapsed).to be > 30 # Should take at least 30 seconds for 3 requests
end
end
Performance Optimization
Concurrent Processing
For large-scale scraping, implement concurrent processing:
require 'concurrent-ruby'
class ConcurrentGoogleScraper < GoogleScraper
def search_multiple_queries(queries, max_threads = 5)
executor = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: max_threads,
max_queue: queries.length
)
promises = queries.map do |query|
Concurrent::Promise.execute(executor: executor) do
search_with_delay(query)
end
end
results = promises.map(&:value!)
executor.shutdown
executor.wait_for_termination
results
end
end
Memory Management
For processing large datasets:
def process_large_result_set(queries)
results = []
queries.each_slice(10) do |query_batch|
batch_results = search_multiple_queries(query_batch)
# Process results immediately
batch_results.each do |result|
yield(result) if block_given?
end
# Clear memory
batch_results.clear
GC.start
end
end
# Usage
scraper.process_large_result_set(large_query_list) do |result|
# Process each result immediately
database.save(result)
end
Conclusion
Scraping Google Search results with Ruby and Nokogiri requires careful consideration of anti-bot measures, proper rate limiting, and adherence to legal guidelines. The examples provided offer a solid foundation for building robust scrapers, but remember that Google's structure and anti-bot measures evolve constantly.
For production applications, consider using dedicated APIs or services like WebScraping.AI that handle these complexities while ensuring compliance and reliability. Always respect robots.txt files and implement proper error handling and retry mechanisms in your scraping applications.
Key takeaways for successful Google scraping:
- Implement proper rate limiting to avoid being blocked
- Use realistic headers and user agents to appear more legitimate
- Handle errors gracefully with retry logic and exponential backoff
- Cache results to minimize redundant requests
- Test thoroughly with comprehensive test suites
- Stay compliant with legal requirements and terms of service
Remember that while these techniques work for educational and research purposes, commercial scraping of Google Search results should be done in compliance with Google's Terms of Service and applicable laws in your jurisdiction.