How do you handle different content types when scraping with Ruby?

When scraping content with Ruby, you'll need to handle various content types such as HTML, JSON, XML, etc. It's important to parse these content types correctly to extract the data you need. Here's how you can handle different content types when scraping with Ruby:

HTML

For HTML content, you can use libraries like Nokogiri, which provides an easy way to navigate and search the HTML document.

require 'nokogiri'
require 'open-uri'

url = 'http://example.com'
html = open(url)

doc = Nokogiri::HTML(html)
title = doc.css('title').text
puts title

JSON

When dealing with JSON data, you can use Ruby's built-in JSON library to parse the content.

require 'json'
require 'open-uri'

url = 'http://example.com/data.json'
json_data = open(url).read

data = JSON.parse(json_data)
puts data

XML

For XML content, you can also use Nokogiri, which supports XML parsing in addition to HTML.

require 'nokogiri'
require 'open-uri'

url = 'http://example.com/data.xml'
xml_data = open(url)

doc = Nokogiri::XML(xml_data)
# Assume the XML has <item> elements
items = doc.xpath('//item')
items.each do |item|
  puts item.text
end

CSV

Ruby's built-in CSV library can be used to handle CSV content.

require 'csv'
require 'open-uri'

url = 'http://example.com/data.csv'
csv_data = open(url)

CSV.parse(csv_data, headers: true) do |row|
  puts row.to_hash
end

Handling Content-Type Header

Sometimes you need to determine the content type dynamically by inspecting the Content-Type header of the HTTP response. You can use open-uri to get the content type and then decide how to parse the response body.

require 'open-uri'
require 'nokogiri'
require 'json'

url = 'http://example.com/data'

open(url) do |response|
  content_type = response.content_type

  case content_type
  when 'application/json'
    data = JSON.parse(response.read)
    puts data
  when 'text/html'
    doc = Nokogiri::HTML(response.read)
    puts doc.css('title').text
  when 'text/xml', 'application/xml'
    doc = Nokogiri::XML(response.read)
    puts doc.xpath('//item').text
  # Add additional content type handling as needed
  else
    puts "Unknown content type: #{content_type}"
  end
end

In the above example, we're using the open-uri library to fetch the content and then switch our parsing strategy based on the Content-Type header received in the HTTP response.

When scraping, it's also important to respect the website's robots.txt and terms of service, to avoid legal issues and to be a good web citizen. Always scrape responsibly and consider the impact on the website's resources.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon