How do I scrape and transform data into JSON or XML using Nokogiri?

Nokogiri is a Ruby gem that is widely used for parsing HTML and XML. When scraping data from web pages or processing XML documents, you may want to transform the data into JSON or XML formats for easier consumption by other services or applications. Below are examples showing how to scrape data using Nokogiri and transform it into JSON and XML formats.

Install Nokogiri

Before you begin, you need to have Nokogiri installed. You can install it using the following command:

gem install nokogiri

Example: Scraping HTML and Converting to JSON

Let's say you want to scrape a list of items from an HTML page and convert it to JSON.

require 'nokogiri'
require 'open-uri'
require 'json'

# Fetch and parse the HTML document
doc = Nokogiri::HTML(URI.open('http://example.com'))

# Suppose the items you want to scrape are in <li> tags within a <ul> with a class 'items-list'
items = []
doc.css('ul.items-list li').each do |li|
  item = {
    name: li.css('.item-name').text.strip,
    description: li.css('.item-description').text.strip,
    price: li.css('.item-price').text.strip
  }
  items << item
end

# Convert the array of items to JSON
json_data = items.to_json

# Output the JSON data
puts json_data

In the example, we're assuming that each list item (<li>) within the unordered list (<ul>) with class items-list contains child elements with classes item-name, item-description, and item-price that hold the relevant data.

Example: Scraping HTML and Converting to XML

If you want to convert the scraped data to XML instead of JSON, you can use Nokogiri's XML builder feature.

require 'nokogiri'
require 'open-uri'

# Fetch and parse the HTML document
doc = Nokogiri::HTML(URI.open('http://example.com'))

# Suppose the items you want to scrape are in <li> tags within a <ul> with a class 'items-list'
builder = Nokogiri::XML::Builder.new do |xml|
  xml.items {
    doc.css('ul.items-list li').each do |li|
      xml.item {
        xml.name li.css('.item-name').text.strip
        xml.description li.css('.item-description').text.strip
        xml.price li.css('.item-price').text.strip
      }
    end
  }
end

# Output the XML data
puts builder.to_xml

This script will generate an XML representation of the items with each item element containing the name, description, and price sub-elements.

Handling Errors and Edge Cases

When scraping data from web pages, it's important to handle errors and edge cases. The site's structure might change, or the page might be temporarily unavailable. You should account for these possibilities by adding error handling to your code.

For example, when opening a URL, you might want to rescue from OpenURI::HTTPError:

begin
  doc = Nokogiri::HTML(URI.open('http://example.com'))
  # ... rest of the scraping code ...
rescue OpenURI::HTTPError => e
  puts "Error accessing page: #{e.message}"
end

Remember to respect the robots.txt file of any website you scrape and comply with its terms of service. Additionally, make sure you're not making too many requests in a short period, as this can overload the server and may be considered abusive behavior.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon