Table of contents

How can I parse HTML responses with HTTParty and Nokogiri?

Combining HTTParty and Nokogiri creates a powerful toolchain for web scraping in Ruby. HTTParty handles HTTP requests while Nokogiri excels at parsing and manipulating HTML/XML documents. This combination allows you to fetch web pages and extract structured data efficiently.

Overview of HTTParty and Nokogiri

HTTParty is a Ruby gem that simplifies HTTP requests with an intuitive API. It handles various HTTP methods, headers, authentication, and response parsing.

Nokogiri is a fast and flexible HTML/XML parser that provides CSS selectors and XPath support for element selection and manipulation.

Installation and Setup

First, add both gems to your Gemfile:

gem 'httparty'
gem 'nokogiri'

Then install them:

bundle install

Or install individually:

gem install httparty nokogiri

Basic HTML Parsing with HTTParty and Nokogiri

Here's a simple example that fetches a webpage and parses its content:

require 'httparty'
require 'nokogiri'

class WebScraper
  include HTTParty

  def self.scrape_page(url)
    # Fetch the webpage
    response = get(url)

    # Parse the HTML response
    doc = Nokogiri::HTML(response.body)

    # Extract data using CSS selectors
    title = doc.css('title').text
    headings = doc.css('h1, h2, h3').map(&:text)

    {
      title: title,
      headings: headings,
      status: response.code
    }
  end
end

# Usage
result = WebScraper.scrape_page('https://example.com')
puts result[:title]

Advanced HTML Parsing Techniques

Using XPath Selectors

XPath provides more powerful selection capabilities than CSS selectors:

require 'httparty'
require 'nokogiri'

def scrape_with_xpath(url)
  response = HTTParty.get(url)
  doc = Nokogiri::HTML(response.body)

  # Extract links with specific attributes
  external_links = doc.xpath('//a[starts-with(@href, "http")]/@href').map(&:value)

  # Find elements containing specific text
  paragraphs_with_keyword = doc.xpath('//p[contains(text(), "scraping")]').map(&:text)

  # Get elements by position
  first_paragraph = doc.xpath('//p[1]').text

  {
    external_links: external_links,
    keyword_paragraphs: paragraphs_with_keyword,
    first_paragraph: first_paragraph
  }
end

Handling Different Content Types

def parse_response_by_type(url)
  response = HTTParty.get(url)

  case response.headers['content-type']
  when /html/
    doc = Nokogiri::HTML(response.body)
    return extract_html_data(doc)
  when /xml/
    doc = Nokogiri::XML(response.body)
    return extract_xml_data(doc)
  when /json/
    return JSON.parse(response.body)
  else
    return response.body
  end
end

def extract_html_data(doc)
  {
    title: doc.css('title').text,
    meta_description: doc.css('meta[name="description"]').first&.[]('content'),
    article_content: doc.css('article, .content, .post').map(&:text)
  }
end

Error Handling and Robust Parsing

Implement proper error handling to make your scraper more reliable:

require 'httparty'
require 'nokogiri'

class RobustScraper
  include HTTParty

  # Set default options
  default_timeout 30
  headers 'User-Agent' => 'Mozilla/5.0 (compatible; RubyScraper/1.0)'

  def self.safe_scrape(url)
    begin
      response = get(url)

      # Check response status
      unless response.success?
        return { error: "HTTP #{response.code}: #{response.message}" }
      end

      # Attempt to parse HTML
      doc = Nokogiri::HTML(response.body)

      # Verify the document was parsed successfully
      if doc.nil? || doc.errors.any?
        return { error: "Failed to parse HTML", parse_errors: doc.errors }
      end

      # Extract data safely
      data = extract_safe_data(doc)
      return { success: true, data: data }

    rescue HTTParty::Error => e
      return { error: "HTTP error: #{e.message}" }
    rescue Nokogiri::XML::SyntaxError => e
      return { error: "Parse error: #{e.message}" }
    rescue => e
      return { error: "Unexpected error: #{e.message}" }
    end
  end

  private

  def self.extract_safe_data(doc)
    {
      title: doc.css('title').first&.text&.strip || 'No title found',
      headings: doc.css('h1, h2, h3').map { |h| h.text.strip }.reject(&:empty?),
      links: doc.css('a[href]').map { |link| 
        { text: link.text.strip, url: link['href'] } 
      }.reject { |link| link[:text].empty? },
      images: doc.css('img[src]').map { |img|
        { alt: img['alt'] || '', src: img['src'] }
      }
    }
  end
end

# Usage with error handling
result = RobustScraper.safe_scrape('https://example.com')
if result[:success]
  puts "Title: #{result[:data][:title]}"
  puts "Found #{result[:data][:links].count} links"
else
  puts "Error: #{result[:error]}"
end

Working with Forms and Form Data

Extract form information for automated form submissions:

def extract_form_data(url)
  response = HTTParty.get(url)
  doc = Nokogiri::HTML(response.body)

  forms = doc.css('form').map do |form|
    {
      action: form['action'],
      method: form['method'] || 'GET',
      fields: form.css('input, select, textarea').map do |field|
        {
          name: field['name'],
          type: field['type'] || field.name,
          value: field['value'],
          required: field.has_attribute?('required')
        }
      end.compact
    }
  end

  forms
end

# Example: Auto-fill and submit a form
def submit_form_data(url, form_data)
  # First, get the page to extract CSRF tokens or hidden fields
  response = HTTParty.get(url)
  doc = Nokogiri::HTML(response.body)

  # Extract hidden fields (like CSRF tokens)
  hidden_fields = {}
  doc.css('form input[type="hidden"]').each do |field|
    hidden_fields[field['name']] = field['value']
  end

  # Merge with our form data
  complete_data = hidden_fields.merge(form_data)

  # Submit the form
  form_action = doc.css('form').first['action']
  HTTParty.post(form_action, body: complete_data)
end

Performance Optimization

Streaming Large Documents

For large HTML documents, consider streaming to reduce memory usage:

require 'nokogiri'

def parse_large_document(url)
  response = HTTParty.get(url, stream_body: true)

  # Use SAX parser for large documents
  parser = Nokogiri::HTML::SAX::Parser.new(MyDocumentHandler.new)
  parser.parse(response.body)
end

class MyDocumentHandler < Nokogiri::XML::SAX::Document
  def initialize
    @current_element = nil
    @data = []
  end

  def start_element(name, attrs = [])
    @current_element = name
    if name == 'a' && attrs.find { |attr| attr[0] == 'href' }
      @data << { type: 'link', href: attrs.find { |attr| attr[0] == 'href' }[1] }
    end
  end

  def characters(string)
    if @current_element == 'title'
      @data << { type: 'title', text: string.strip }
    end
  end
end

Connection Reuse and Pooling

class PooledScraper
  include HTTParty

  # Enable connection persistence
  persistent_connection_adapter

  # Set connection pool options
  default_options.update(
    timeout: 30,
    open_timeout: 10,
    read_timeout: 30,
    pool_size: 10
  )

  def self.scrape_multiple_pages(urls)
    results = urls.map do |url|
      response = get(url)
      doc = Nokogiri::HTML(response.body)

      {
        url: url,
        title: doc.css('title').text,
        status: response.code
      }
    end

    results
  end
end

Common Patterns and Best Practices

Data Extraction with CSS Selectors

def extract_structured_data(doc)
  # Extract article metadata
  article_data = {
    title: doc.css('h1').first&.text&.strip,
    author: doc.css('.author, [data-author]').first&.text&.strip,
    date: doc.css('time, .date, [datetime]').first&.[]('datetime'),
    content: doc.css('.content, .article-body, main').first&.text&.strip,
    tags: doc.css('.tag, .category').map { |tag| tag.text.strip },
    images: doc.css('img').map { |img| 
      {
        src: img['src'],
        alt: img['alt'],
        caption: img.parent.css('figcaption').text
      }
    }
  }

  # Clean up empty values
  article_data.compact
end

Handling Dynamic Content

While HTTParty and Nokogiri work great for static HTML, some sites require JavaScript execution. For such cases, you might need to consider alternatives that can handle dynamic content generation, similar to how modern browser automation tools handle JavaScript-heavy websites.

Rate Limiting and Respectful Scraping

class RespectfulScraper
  include HTTParty

  def self.scrape_with_delays(urls, delay: 1)
    results = []

    urls.each_with_index do |url, index|
      # Add delay between requests
      sleep(delay) unless index.zero?

      begin
        response = get(url)
        doc = Nokogiri::HTML(response.body)

        results << extract_data(doc)
        puts "Scraped: #{url} (#{response.code})"

      rescue => e
        puts "Failed to scrape #{url}: #{e.message}"
        results << { error: e.message, url: url }
      end
    end

    results
  end
end

Debugging and Troubleshooting

Inspecting HTTP Responses

def debug_response(url)
  response = HTTParty.get(url, debug_output: STDOUT)

  puts "Status: #{response.code}"
  puts "Headers: #{response.headers}"
  puts "Content-Type: #{response.headers['content-type']}"
  puts "Body length: #{response.body.length}"

  # Save response for inspection
  File.write('debug_response.html', response.body)

  doc = Nokogiri::HTML(response.body)
  puts "Document errors: #{doc.errors}" unless doc.errors.empty?
end

Testing Selectors

def test_selectors(html_content)
  doc = Nokogiri::HTML(html_content)

  selectors = [
    'title',
    'h1, h2, h3',
    '.content, .article, main',
    'a[href^="http"]'
  ]

  selectors.each do |selector|
    elements = doc.css(selector)
    puts "#{selector}: Found #{elements.count} elements"
    elements.first(3).each { |el| puts "  - #{el.text.strip[0..50]}..." }
  end
end

Conclusion

HTTParty and Nokogiri provide a robust foundation for web scraping in Ruby. HTTParty handles the HTTP complexity while Nokogiri offers powerful HTML parsing capabilities. This combination allows you to build efficient scrapers that can extract structured data from websites.

Key takeaways: - Always implement proper error handling for network and parsing errors - Use appropriate selectors (CSS or XPath) based on your extraction needs - Respect website resources with proper delays and connection management - Consider the website's structure and content type when designing your parsing logic - Test your selectors thoroughly and handle edge cases gracefully

For more complex scenarios involving dynamic content or sophisticated browser interactions, you might need to explore headless browser solutions that can handle authentication flows or manage complex browser sessions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon