Scraping websites without a dedicated scraping library like Nokogiri in Ruby is a bit more cumbersome, but it can be done using built-in libraries such as net/http
for making HTTP requests and rexml
or oga
for parsing the HTML content. Below is a step-by-step guide on how you can achieve this:
Step 1: Making an HTTP Request
You need to make an HTTP request to the website from which you want to scrape data. Ruby's built-in net/http
library can be used for this purpose.
require 'net/http'
require 'uri'
# Parse the URI for the website you want to scrape
uri = URI.parse("http://example.com")
# Make an HTTP GET request
response = Net::HTTP.get_response(uri)
# Check if the request was successful
if response.is_a?(Net::HTTPSuccess)
# Proceed with scraping the content
html_content = response.body
else
puts "Failed to retrieve web page"
end
Step 2: Parsing HTML Content
Once you have the HTML content, you can parse it using Ruby's built-in XML parsers such as rexml
or you can use oga
, which is not a built-in library but is much easier to use for parsing HTML. If you choose to use oga
, you will need to install it first using gem install oga
.
Using rexml
(built-in):
require 'rexml/document'
include REXML
# Assuming 'html_content' is the HTML you fetched earlier
document = Document.new(html_content)
# XPath to find elements, for example, extracting all the links
document.elements.each('*/a') do |element|
puts element.attributes['href']
end
Using oga
(third-party):
First, install oga
if you haven't already:
gem install oga
Then you can use it as follows:
require 'oga'
# Assuming 'html_content' is the HTML you fetched earlier
document = Oga.parse_html(html_content)
# Use CSS selectors to find elements, for example, extracting all the links
document.css('a').each do |link|
puts link.get('href')
end
Step 3: Extracting Data
After parsing the HTML content, you can navigate through the elements to extract the data you need. Both rexml
and oga
provide ways to query the document using XPath or CSS selectors.
Step 4: Handling Errors and Rate Limiting
When scraping websites, it's important to handle errors gracefully and respect the website's terms of service, including rate limiting. Make sure to add error handling to your code and consider adding delays between requests if necessary.
Conclusion
Using built-in libraries for scraping in Ruby is more challenging and may be less efficient than using a dedicated scraping library like Nokogiri. However, it is possible and sometimes necessary, especially in environments where you cannot install additional gems or when dealing with very simple scraping tasks. Always make sure to scrape ethically by respecting robots.txt and the website's terms of service.