Simulating human activity in web scraping is a technique used to evade detection by web servers that might otherwise block or limit access to their data by bots or automated scripts. In Ruby, you can simulate human-like behavior using a variety of strategies, such as randomizing request intervals, rotating user agents, and using headless browsers to mimic a real user's browsing patterns.
Here are some tips and techniques for simulating human activity in Ruby web scraping:
1. Randomize Request Intervals
Instead of making requests at a constant interval, you can randomize the delay between requests to make them appear less robotic.
require 'nokogiri'
require 'httparty'
require 'faker'
def human_like_sleep
sleep(rand(1.0..5.0)) # Random sleep between 1 and 5 seconds
end
def fetch_data(url)
human_like_sleep
response = HTTParty.get(url, headers: { "User-Agent" => Faker::Internet.user_agent })
Nokogiri::HTML(response.body)
end
# Usage
data = fetch_data('https://example.com')
2. Rotate User Agents
Websites track User-Agent strings to identify bots. By changing the User-Agent string for each request, you can reduce the likelihood of being flagged as a bot.
require 'faker'
def get_rotating_user_agent
Faker::Internet.user_agent
end
# Usage
options = {
headers: { "User-Agent" => get_rotating_user_agent }
}
response = HTTParty.get('https://example.com', options)
3. Use Headless Browsers
Headless browsers can simulate a full browsing experience, including JavaScript execution and cookie handling, which is more akin to human activity.
require 'watir'
browser = Watir::Browser.new :chrome, headless: true
# Mimic human browsing patterns
browser.goto 'https://example.com'
browser.wait_until { |b| b.title.include?('Example') }
browser.links.sample.click # Randomly click a link
sleep(rand(2..10)) # Random sleep to mimic reading time
browser.close
4. Mimic Human Cursor Movements and Clicks
Emulating mouse movements and clicks can be achieved using Selenium WebDriver.
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome
driver.navigate.to 'https://example.com'
element = driver.find_element(:css, 'a.some-link')
driver.action.move_to(element).perform # Move cursor to the element
driver.action.click.perform # Click the element
sleep(rand(1..5))
driver.quit
5. Use Proxies or VPNs
Rotating IP addresses using proxies or VPNs can prevent your scraper from getting blocked by IP-based rate limits.
require 'httparty'
proxy_options = {
http_proxyaddr: 'proxy_ip',
http_proxyport: 'proxy_port',
http_proxyuser: 'username',
http_proxypass: 'password'
}
response = HTTParty.get('https://example.com', proxy_options)
6. Obey robots.txt
Respect the robots.txt
file of websites. If scraping is disallowed for certain paths, avoid them to prevent detection.
require 'robots'
robots = Robots.new("My Ruby Scraper/1.0")
if robots.allowed?("http://www.example.com/", "My Ruby Scraper/1.0")
# Proceed with scraping
end
Additional Tips:
- Do Not Overload Servers: Making too many requests in a short amount of time can lead to detection. Always throttle your requests.
- Use Real Browser Headers: Collect headers from real browsers and use them in your requests.
- Handle CAPTCHAs: Some sites might present CAPTCHAs to verify you're human; consider services that can solve CAPTCHAs for you.
Keep in mind that scraping can be a legal and ethical gray area. Always review the terms of service of the website you are scraping, and consider reaching out for permission to scrape. The techniques above are to be used responsibly and not for circumventing legal or ethical boundaries.