When scraping content with Ruby, you'll need to handle various content types such as HTML, JSON, XML, etc. It's important to parse these content types correctly to extract the data you need. Here's how you can handle different content types when scraping with Ruby:
HTML
For HTML content, you can use libraries like Nokogiri, which provides an easy way to navigate and search the HTML document.
require 'nokogiri'
require 'open-uri'
url = 'http://example.com'
html = open(url)
doc = Nokogiri::HTML(html)
title = doc.css('title').text
puts title
JSON
When dealing with JSON data, you can use Ruby's built-in JSON library to parse the content.
require 'json'
require 'open-uri'
url = 'http://example.com/data.json'
json_data = open(url).read
data = JSON.parse(json_data)
puts data
XML
For XML content, you can also use Nokogiri, which supports XML parsing in addition to HTML.
require 'nokogiri'
require 'open-uri'
url = 'http://example.com/data.xml'
xml_data = open(url)
doc = Nokogiri::XML(xml_data)
# Assume the XML has <item> elements
items = doc.xpath('//item')
items.each do |item|
puts item.text
end
CSV
Ruby's built-in CSV library can be used to handle CSV content.
require 'csv'
require 'open-uri'
url = 'http://example.com/data.csv'
csv_data = open(url)
CSV.parse(csv_data, headers: true) do |row|
puts row.to_hash
end
Handling Content-Type Header
Sometimes you need to determine the content type dynamically by inspecting the Content-Type
header of the HTTP response. You can use open-uri
to get the content type and then decide how to parse the response body.
require 'open-uri'
require 'nokogiri'
require 'json'
url = 'http://example.com/data'
open(url) do |response|
content_type = response.content_type
case content_type
when 'application/json'
data = JSON.parse(response.read)
puts data
when 'text/html'
doc = Nokogiri::HTML(response.read)
puts doc.css('title').text
when 'text/xml', 'application/xml'
doc = Nokogiri::XML(response.read)
puts doc.xpath('//item').text
# Add additional content type handling as needed
else
puts "Unknown content type: #{content_type}"
end
end
In the above example, we're using the open-uri
library to fetch the content and then switch our parsing strategy based on the Content-Type
header received in the HTTP response.
When scraping, it's also important to respect the website's robots.txt
and terms of service, to avoid legal issues and to be a good web citizen. Always scrape responsibly and consider the impact on the website's resources.