Nokogiri is a Ruby gem that is widely used for parsing HTML and XML. When scraping data from web pages or processing XML documents, you may want to transform the data into JSON or XML formats for easier consumption by other services or applications. Below are examples showing how to scrape data using Nokogiri and transform it into JSON and XML formats.
Install Nokogiri
Before you begin, you need to have Nokogiri installed. You can install it using the following command:
gem install nokogiri
Example: Scraping HTML and Converting to JSON
Let's say you want to scrape a list of items from an HTML page and convert it to JSON.
require 'nokogiri'
require 'open-uri'
require 'json'
# Fetch and parse the HTML document
doc = Nokogiri::HTML(URI.open('http://example.com'))
# Suppose the items you want to scrape are in <li> tags within a <ul> with a class 'items-list'
items = []
doc.css('ul.items-list li').each do |li|
item = {
name: li.css('.item-name').text.strip,
description: li.css('.item-description').text.strip,
price: li.css('.item-price').text.strip
}
items << item
end
# Convert the array of items to JSON
json_data = items.to_json
# Output the JSON data
puts json_data
In the example, we're assuming that each list item (<li>
) within the unordered list (<ul>
) with class items-list
contains child elements with classes item-name
, item-description
, and item-price
that hold the relevant data.
Example: Scraping HTML and Converting to XML
If you want to convert the scraped data to XML instead of JSON, you can use Nokogiri's XML builder feature.
require 'nokogiri'
require 'open-uri'
# Fetch and parse the HTML document
doc = Nokogiri::HTML(URI.open('http://example.com'))
# Suppose the items you want to scrape are in <li> tags within a <ul> with a class 'items-list'
builder = Nokogiri::XML::Builder.new do |xml|
xml.items {
doc.css('ul.items-list li').each do |li|
xml.item {
xml.name li.css('.item-name').text.strip
xml.description li.css('.item-description').text.strip
xml.price li.css('.item-price').text.strip
}
end
}
end
# Output the XML data
puts builder.to_xml
This script will generate an XML representation of the items with each item
element containing the name
, description
, and price
sub-elements.
Handling Errors and Edge Cases
When scraping data from web pages, it's important to handle errors and edge cases. The site's structure might change, or the page might be temporarily unavailable. You should account for these possibilities by adding error handling to your code.
For example, when opening a URL, you might want to rescue from OpenURI::HTTPError
:
begin
doc = Nokogiri::HTML(URI.open('http://example.com'))
# ... rest of the scraping code ...
rescue OpenURI::HTTPError => e
puts "Error accessing page: #{e.message}"
end
Remember to respect the robots.txt
file of any website you scrape and comply with its terms of service. Additionally, make sure you're not making too many requests in a short period, as this can overload the server and may be considered abusive behavior.