Table of contents

How do I scrape data from REST APIs using Ruby?

Scraping data from REST APIs using Ruby is a fundamental skill for developers working with web services and data integration. Ruby provides several powerful libraries and approaches for making HTTP requests, handling responses, and processing API data efficiently.

Understanding REST API Scraping vs Web Scraping

REST API scraping differs from traditional web scraping in that you're working with structured data endpoints rather than parsing HTML content. APIs typically return JSON or XML data, making data extraction more straightforward and reliable than parsing HTML markup.

Essential Ruby Libraries for API Scraping

1. Net::HTTP (Built-in)

Ruby's standard library includes Net::HTTP, which provides basic HTTP functionality:

require 'net/http'
require 'json'
require 'uri'

def fetch_api_data(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)

  if response.code == '200'
    JSON.parse(response.body)
  else
    puts "Error: #{response.code} - #{response.message}"
    nil
  end
end

# Example usage
data = fetch_api_data('https://api.example.com/users')
puts data

2. HTTParty (Recommended)

HTTParty is a popular gem that simplifies HTTP requests:

require 'httparty'

class ApiScraper
  include HTTParty
  base_uri 'https://api.example.com'

  def self.get_users
    get('/users')
  end

  def self.get_user(id)
    get("/users/#{id}")
  end
end

# Usage
users = ApiScraper.get_users
user = ApiScraper.get_user(1)

3. Faraday (Advanced)

Faraday offers more flexibility and middleware support:

require 'faraday'
require 'json'

class AdvancedApiScraper
  def initialize(base_url)
    @conn = Faraday.new(url: base_url) do |faraday|
      faraday.request :json
      faraday.response :json
      faraday.adapter Faraday.default_adapter
    end
  end

  def get(endpoint)
    response = @conn.get(endpoint)
    response.body if response.success?
  end
end

# Usage
scraper = AdvancedApiScraper.new('https://api.example.com')
data = scraper.get('/users')

Handling Authentication

API Key Authentication

require 'httparty'

class AuthenticatedScraper
  include HTTParty
  base_uri 'https://api.example.com'

  def initialize(api_key)
    @options = {
      headers: {
        'Authorization' => "Bearer #{api_key}",
        'Content-Type' => 'application/json'
      }
    }
  end

  def get_data(endpoint)
    self.class.get(endpoint, @options)
  end
end

# Usage
scraper = AuthenticatedScraper.new('your-api-key')
data = scraper.get_data('/protected-endpoint')

OAuth 2.0 Authentication

require 'httparty'
require 'oauth2'

class OAuthScraper
  def initialize(client_id, client_secret, site)
    @client = OAuth2::Client.new(client_id, client_secret, site: site)
  end

  def get_access_token
    @client.client_credentials.get_token
  end

  def fetch_data(endpoint)
    token = get_access_token
    response = token.get(endpoint)
    JSON.parse(response.body)
  end
end

Error Handling and Retries

Robust API scraping requires proper error handling:

require 'httparty'
require 'retries'

class RobustApiScraper
  include HTTParty

  def self.fetch_with_retry(url, max_retries = 3)
    with_retries(max_tries: max_retries, 
                 base_sleep_seconds: 1, 
                 max_sleep_seconds: 10) do
      response = get(url)

      case response.code
      when 200
        response.parsed_response
      when 429
        # Rate limited - wait longer
        sleep(60)
        raise 'Rate limited'
      when 500..599
        # Server error - retry
        raise "Server error: #{response.code}"
      else
        # Client error - don't retry
        puts "Client error: #{response.code}"
        return nil
      end
    end
  rescue => e
    puts "Failed after #{max_retries} retries: #{e.message}"
    nil
  end
end

Rate Limiting and Respectful Scraping

Implement rate limiting to avoid overwhelming APIs:

class RateLimitedScraper
  def initialize(requests_per_second = 1)
    @min_interval = 1.0 / requests_per_second
    @last_request_time = 0
  end

  def make_request(url)
    sleep_time = @min_interval - (Time.now - @last_request_time)
    sleep(sleep_time) if sleep_time > 0

    @last_request_time = Time.now
    HTTParty.get(url)
  end
end

# Usage
scraper = RateLimitedScraper.new(2) # 2 requests per second
response = scraper.make_request('https://api.example.com/data')

Pagination Handling

Many APIs use pagination for large datasets:

class PaginatedScraper
  include HTTParty
  base_uri 'https://api.example.com'

  def fetch_all_pages(endpoint, per_page = 100)
    all_data = []
    page = 1

    loop do
      response = self.class.get(endpoint, query: {
        page: page,
        per_page: per_page
      })

      break unless response.success?

      data = response.parsed_response
      break if data['data'].empty?

      all_data.concat(data['data'])
      page += 1

      # Check if there are more pages
      break if page > data['total_pages']

      # Rate limiting
      sleep(0.5)
    end

    all_data
  end
end

Data Processing and Storage

Process and store the scraped data efficiently:

require 'csv'
require 'sqlite3'

class DataProcessor
  def initialize(db_path = 'scraped_data.db')
    @db = SQLite3::Database.new(db_path)
    create_tables
  end

  def create_tables
    @db.execute <<-SQL
      CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY,
        name TEXT,
        email TEXT,
        created_at DATETIME
      )
    SQL
  end

  def store_users(users_data)
    users_data.each do |user|
      @db.execute(
        "INSERT OR REPLACE INTO users (id, name, email, created_at) VALUES (?, ?, ?, ?)",
        [user['id'], user['name'], user['email'], Time.now]
      )
    end
  end

  def export_to_csv(filename = 'users.csv')
    CSV.open(filename, 'w', write_headers: true, headers: ['ID', 'Name', 'Email', 'Created At']) do |csv|
      @db.execute("SELECT * FROM users") do |row|
        csv << row
      end
    end
  end
end

Concurrent API Scraping

For improved performance, use concurrent requests:

require 'concurrent-ruby'
require 'httparty'

class ConcurrentScraper
  def initialize(max_threads = 10)
    @pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 2,
      max_threads: max_threads,
      max_queue: 100
    )
  end

  def scrape_urls(urls)
    futures = urls.map do |url|
      Concurrent::Future.execute(executor: @pool) do
        HTTParty.get(url)
      end
    end

    futures.map(&:value)
  end

  def shutdown
    @pool.shutdown
    @pool.wait_for_termination
  end
end

# Usage
scraper = ConcurrentScraper.new(5)
urls = ['https://api.example.com/users/1', 'https://api.example.com/users/2']
responses = scraper.scrape_urls(urls)
scraper.shutdown

Complete Example: GitHub API Scraper

Here's a comprehensive example that scrapes GitHub repositories:

require 'httparty'
require 'json'

class GitHubScraper
  include HTTParty
  base_uri 'https://api.github.com'

  def initialize(token = nil)
    @headers = {
      'User-Agent' => 'Ruby-API-Scraper',
      'Accept' => 'application/vnd.github.v3+json'
    }
    @headers['Authorization'] = "token #{token}" if token
  end

  def get_user_repos(username, per_page = 30)
    all_repos = []
    page = 1

    loop do
      response = self.class.get(
        "/users/#{username}/repos",
        headers: @headers,
        query: { per_page: per_page, page: page, sort: 'updated' }
      )

      break unless response.success?

      repos = response.parsed_response
      break if repos.empty?

      all_repos.concat(repos)
      page += 1

      # GitHub rate limiting
      sleep(1)
    end

    all_repos
  end

  def extract_repo_info(repos)
    repos.map do |repo|
      {
        name: repo['name'],
        description: repo['description'],
        language: repo['language'],
        stars: repo['stargazers_count'],
        forks: repo['forks_count'],
        url: repo['html_url'],
        updated_at: repo['updated_at']
      }
    end
  end
end

# Usage
scraper = GitHubScraper.new('your-github-token')
repos = scraper.get_user_repos('octocat')
repo_info = scraper.extract_repo_info(repos)

puts JSON.pretty_generate(repo_info)

Best Practices and Tips

  1. Read API Documentation: Always review the API documentation for rate limits, authentication requirements, and response formats.

  2. Use Appropriate User Agents: Set descriptive User-Agent headers to identify your application.

  3. Implement Caching: Cache responses when appropriate to reduce API calls:

require 'redis'

class CachedScraper
  def initialize
    @redis = Redis.new
  end

  def fetch_with_cache(url, ttl = 3600)
    cached = @redis.get(url)
    return JSON.parse(cached) if cached

    response = HTTParty.get(url)
    if response.success?
      @redis.setex(url, ttl, response.body)
      response.parsed_response
    end
  end
end
  1. Monitor API Usage: Track your API usage to stay within limits.

  2. Handle Different Response Formats: Be prepared to handle both JSON and XML responses.

When to Use API Scraping vs Web Scraping

API scraping is preferable when: - The service provides a documented API - You need structured, reliable data - Real-time data updates are important - The website's HTML structure changes frequently

However, when APIs are unavailable or limited, traditional web scraping techniques become necessary. For complex scenarios involving JavaScript-heavy applications, tools like handling AJAX requests using Puppeteer can complement your API scraping strategy.

Command Line Tools

You can also use command line tools for quick API testing and scraping:

# Using curl to test API endpoints
curl -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     https://api.example.com/users

# Using HTTPie for more user-friendly API testing
http GET https://api.example.com/users Authorization:"Bearer YOUR_TOKEN"

# Save response to file
curl -o users.json https://api.example.com/users

Conclusion

Ruby provides excellent tools for REST API scraping, from the built-in Net::HTTP to powerful gems like HTTParty and Faraday. By implementing proper authentication, error handling, rate limiting, and data processing, you can build robust and efficient API scrapers that respect service providers while gathering the data you need.

Remember to always review API terms of service, implement respectful scraping practices, and consider the legal and ethical implications of your data collection activities. With these techniques and best practices, you'll be well-equipped to handle most API scraping scenarios in Ruby.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon