Ruby offers several libraries that are specifically designed for web scraping, parsing HTML, and automating interaction with websites. Here are some of the most common Ruby libraries for web scraping:
Nokogiri: Nokogiri is probably the most popular Ruby library for parsing HTML and XML. It provides a simple way to search and manipulate the DOM of a page using CSS selectors or XPath expressions.
require 'nokogiri' require 'open-uri' url = 'http://example.com' doc = Nokogiri::HTML(URI.open(url)) doc.css('h1').each do |h1| puts h1.content end
Mechanize: Mechanize is a library that automates web interaction. It can handle forms, cookies, sessions, and follow links, much like a browser would. It is built on top of Nokogiri, which it uses for parsing HTML.
require 'mechanize' agent = Mechanize.new page = agent.get('http://example.com') form = page.forms.first form['field'] = 'value' page = agent.submit(form) puts page.search('div.content').text.strip
HTTParty: HTTParty is a library that simplifies making HTTP requests. It's not specifically for web scraping, but it's often used in conjunction with Nokogiri for fetching HTML content before parsing.
require 'httparty' require 'nokogiri' response = HTTParty.get('http://example.com') doc = Nokogiri::HTML(response.body) doc.css('p').each do |p| puts p.text end
Watir: Watir (Web Application Testing in Ruby) is a library for automating web browsers. It allows you to control a browser (like Chrome, Firefox, or Internet Explorer) and interact with web pages as a user would.
require 'watir' browser = Watir::Browser.new :firefox browser.goto 'http://example.com' puts browser.h1.text browser.close
Capybara: While Capybara is primarily used for integration testing web applications, it can also be used for web scraping, especially for JavaScript-heavy websites where you need to interact with the page or wait for elements to become available.
require 'capybara' require 'capybara/dsl' Capybara.current_driver = :selenium_chrome_headless include Capybara::DSL visit 'http://example.com' puts page.find('h1').text
When using these libraries for web scraping, it's important to be mindful of the terms of service of the website you're scraping, as well as any legal considerations. Always respect robots.txt
files and consider rate limiting your requests to avoid causing issues for the website you're scraping from.