How can I use Nokogiri in a command-line application?

Nokogiri is a Ruby library for parsing HTML, XML, SAX, and Reader. It's widely used for web scraping and data extraction. If you want to use Nokogiri in a command-line application, you will need to have Ruby installed on your system and then install the Nokogiri gem.

Here's how you can get started using Nokogiri in a command-line application:

Step 1: Install Ruby

Make sure you have Ruby installed on your system. You can check if Ruby is installed and its version by running:

ruby -v

If Ruby is not installed, you can download and install it from the official Ruby website or use a version manager like rbenv or rvm.

Step 2: Install Nokogiri

Once Ruby is installed, you can install Nokogiri using the following command:

gem install nokogiri

Step 3: Create Your Command-line Application

Create a new Ruby file for your command-line application. For example, you can create a file named scraper.rb.

# scraper.rb
require 'nokogiri'
require 'open-uri'

# Write your scraping code here
url = 'http://www.example.com'
doc = Nokogiri::HTML(URI.open(url))

puts "Title: #{doc.title}"

In this simple example, the script opens a URL, parses the HTML with Nokogiri, and prints the title of the web page.

Step 4: Run Your Application

You can run your command-line application by executing the Ruby file you just created:

ruby scraper.rb

Additional Tips

  • Error Handling: Make sure to handle errors such as network issues or unexpected HTML structure changes.
  • User-Agent: Some websites may block requests that don't originate from a browser, so you might need to set a User-Agent to mimic a browser.
  • Command-line Arguments: If you want to pass arguments to your Ruby script from the command line, you can use ARGV to access them. This allows you to make your script more dynamic and adaptable to different URLs or parameters.
  • Output Formatting: Depending on your needs, you might want to format the output of your script, potentially writing it to a file or formatting it as JSON or CSV.
  • Advanced Usage: As your needs grow, you may want to look into more advanced features of Nokogiri such as XPath and CSS selectors for more precise element selection, or using the SAX parser for large XML files.

Example with Command-line Arguments and Error Handling

Here's an example of a more advanced command-line application that takes a URL as an argument and has basic error handling:

# scraper.rb
require 'nokogiri'
require 'open-uri'

if ARGV.empty?
  puts "Usage: ruby scraper.rb <URL>"
  exit
end

begin
  url = ARGV[0]
  html = URI.open(url, 'User-Agent' => 'ruby')
  doc = Nokogiri::HTML(html)

  puts "Title: #{doc.title}"
rescue OpenURI::HTTPError => e
  puts "Failed to retrieve web page: #{e.message}"
rescue SocketError => e
  puts "Failed to connect to the server: #{e.message}"
rescue StandardError => e
  puts "An error occurred: #{e.message}"
end

Run the script with:

ruby scraper.rb http://www.example.com

Remember that web scraping can have legal and ethical implications. Always check a website's robots.txt and Terms of Service to ensure you are allowed to scrape it, and never scrape at a high rate that could negatively impact the website's server.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon