How do I scrape data from REST APIs using Ruby?
Scraping data from REST APIs using Ruby is a fundamental skill for developers working with web services and data integration. Ruby provides several powerful libraries and approaches for making HTTP requests, handling responses, and processing API data efficiently.
Understanding REST API Scraping vs Web Scraping
REST API scraping differs from traditional web scraping in that you're working with structured data endpoints rather than parsing HTML content. APIs typically return JSON or XML data, making data extraction more straightforward and reliable than parsing HTML markup.
Essential Ruby Libraries for API Scraping
1. Net::HTTP (Built-in)
Ruby's standard library includes Net::HTTP
, which provides basic HTTP functionality:
require 'net/http'
require 'json'
require 'uri'
def fetch_api_data(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
if response.code == '200'
JSON.parse(response.body)
else
puts "Error: #{response.code} - #{response.message}"
nil
end
end
# Example usage
data = fetch_api_data('https://api.example.com/users')
puts data
2. HTTParty (Recommended)
HTTParty is a popular gem that simplifies HTTP requests:
require 'httparty'
class ApiScraper
include HTTParty
base_uri 'https://api.example.com'
def self.get_users
get('/users')
end
def self.get_user(id)
get("/users/#{id}")
end
end
# Usage
users = ApiScraper.get_users
user = ApiScraper.get_user(1)
3. Faraday (Advanced)
Faraday offers more flexibility and middleware support:
require 'faraday'
require 'json'
class AdvancedApiScraper
def initialize(base_url)
@conn = Faraday.new(url: base_url) do |faraday|
faraday.request :json
faraday.response :json
faraday.adapter Faraday.default_adapter
end
end
def get(endpoint)
response = @conn.get(endpoint)
response.body if response.success?
end
end
# Usage
scraper = AdvancedApiScraper.new('https://api.example.com')
data = scraper.get('/users')
Handling Authentication
API Key Authentication
require 'httparty'
class AuthenticatedScraper
include HTTParty
base_uri 'https://api.example.com'
def initialize(api_key)
@options = {
headers: {
'Authorization' => "Bearer #{api_key}",
'Content-Type' => 'application/json'
}
}
end
def get_data(endpoint)
self.class.get(endpoint, @options)
end
end
# Usage
scraper = AuthenticatedScraper.new('your-api-key')
data = scraper.get_data('/protected-endpoint')
OAuth 2.0 Authentication
require 'httparty'
require 'oauth2'
class OAuthScraper
def initialize(client_id, client_secret, site)
@client = OAuth2::Client.new(client_id, client_secret, site: site)
end
def get_access_token
@client.client_credentials.get_token
end
def fetch_data(endpoint)
token = get_access_token
response = token.get(endpoint)
JSON.parse(response.body)
end
end
Error Handling and Retries
Robust API scraping requires proper error handling:
require 'httparty'
require 'retries'
class RobustApiScraper
include HTTParty
def self.fetch_with_retry(url, max_retries = 3)
with_retries(max_tries: max_retries,
base_sleep_seconds: 1,
max_sleep_seconds: 10) do
response = get(url)
case response.code
when 200
response.parsed_response
when 429
# Rate limited - wait longer
sleep(60)
raise 'Rate limited'
when 500..599
# Server error - retry
raise "Server error: #{response.code}"
else
# Client error - don't retry
puts "Client error: #{response.code}"
return nil
end
end
rescue => e
puts "Failed after #{max_retries} retries: #{e.message}"
nil
end
end
Rate Limiting and Respectful Scraping
Implement rate limiting to avoid overwhelming APIs:
class RateLimitedScraper
def initialize(requests_per_second = 1)
@min_interval = 1.0 / requests_per_second
@last_request_time = 0
end
def make_request(url)
sleep_time = @min_interval - (Time.now - @last_request_time)
sleep(sleep_time) if sleep_time > 0
@last_request_time = Time.now
HTTParty.get(url)
end
end
# Usage
scraper = RateLimitedScraper.new(2) # 2 requests per second
response = scraper.make_request('https://api.example.com/data')
Pagination Handling
Many APIs use pagination for large datasets:
class PaginatedScraper
include HTTParty
base_uri 'https://api.example.com'
def fetch_all_pages(endpoint, per_page = 100)
all_data = []
page = 1
loop do
response = self.class.get(endpoint, query: {
page: page,
per_page: per_page
})
break unless response.success?
data = response.parsed_response
break if data['data'].empty?
all_data.concat(data['data'])
page += 1
# Check if there are more pages
break if page > data['total_pages']
# Rate limiting
sleep(0.5)
end
all_data
end
end
Data Processing and Storage
Process and store the scraped data efficiently:
require 'csv'
require 'sqlite3'
class DataProcessor
def initialize(db_path = 'scraped_data.db')
@db = SQLite3::Database.new(db_path)
create_tables
end
def create_tables
@db.execute <<-SQL
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY,
name TEXT,
email TEXT,
created_at DATETIME
)
SQL
end
def store_users(users_data)
users_data.each do |user|
@db.execute(
"INSERT OR REPLACE INTO users (id, name, email, created_at) VALUES (?, ?, ?, ?)",
[user['id'], user['name'], user['email'], Time.now]
)
end
end
def export_to_csv(filename = 'users.csv')
CSV.open(filename, 'w', write_headers: true, headers: ['ID', 'Name', 'Email', 'Created At']) do |csv|
@db.execute("SELECT * FROM users") do |row|
csv << row
end
end
end
end
Concurrent API Scraping
For improved performance, use concurrent requests:
require 'concurrent-ruby'
require 'httparty'
class ConcurrentScraper
def initialize(max_threads = 10)
@pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 2,
max_threads: max_threads,
max_queue: 100
)
end
def scrape_urls(urls)
futures = urls.map do |url|
Concurrent::Future.execute(executor: @pool) do
HTTParty.get(url)
end
end
futures.map(&:value)
end
def shutdown
@pool.shutdown
@pool.wait_for_termination
end
end
# Usage
scraper = ConcurrentScraper.new(5)
urls = ['https://api.example.com/users/1', 'https://api.example.com/users/2']
responses = scraper.scrape_urls(urls)
scraper.shutdown
Complete Example: GitHub API Scraper
Here's a comprehensive example that scrapes GitHub repositories:
require 'httparty'
require 'json'
class GitHubScraper
include HTTParty
base_uri 'https://api.github.com'
def initialize(token = nil)
@headers = {
'User-Agent' => 'Ruby-API-Scraper',
'Accept' => 'application/vnd.github.v3+json'
}
@headers['Authorization'] = "token #{token}" if token
end
def get_user_repos(username, per_page = 30)
all_repos = []
page = 1
loop do
response = self.class.get(
"/users/#{username}/repos",
headers: @headers,
query: { per_page: per_page, page: page, sort: 'updated' }
)
break unless response.success?
repos = response.parsed_response
break if repos.empty?
all_repos.concat(repos)
page += 1
# GitHub rate limiting
sleep(1)
end
all_repos
end
def extract_repo_info(repos)
repos.map do |repo|
{
name: repo['name'],
description: repo['description'],
language: repo['language'],
stars: repo['stargazers_count'],
forks: repo['forks_count'],
url: repo['html_url'],
updated_at: repo['updated_at']
}
end
end
end
# Usage
scraper = GitHubScraper.new('your-github-token')
repos = scraper.get_user_repos('octocat')
repo_info = scraper.extract_repo_info(repos)
puts JSON.pretty_generate(repo_info)
Best Practices and Tips
Read API Documentation: Always review the API documentation for rate limits, authentication requirements, and response formats.
Use Appropriate User Agents: Set descriptive User-Agent headers to identify your application.
Implement Caching: Cache responses when appropriate to reduce API calls:
require 'redis'
class CachedScraper
def initialize
@redis = Redis.new
end
def fetch_with_cache(url, ttl = 3600)
cached = @redis.get(url)
return JSON.parse(cached) if cached
response = HTTParty.get(url)
if response.success?
@redis.setex(url, ttl, response.body)
response.parsed_response
end
end
end
Monitor API Usage: Track your API usage to stay within limits.
Handle Different Response Formats: Be prepared to handle both JSON and XML responses.
When to Use API Scraping vs Web Scraping
API scraping is preferable when: - The service provides a documented API - You need structured, reliable data - Real-time data updates are important - The website's HTML structure changes frequently
However, when APIs are unavailable or limited, traditional web scraping techniques become necessary. For complex scenarios involving JavaScript-heavy applications, tools like handling AJAX requests using Puppeteer can complement your API scraping strategy.
Command Line Tools
You can also use command line tools for quick API testing and scraping:
# Using curl to test API endpoints
curl -H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
https://api.example.com/users
# Using HTTPie for more user-friendly API testing
http GET https://api.example.com/users Authorization:"Bearer YOUR_TOKEN"
# Save response to file
curl -o users.json https://api.example.com/users
Conclusion
Ruby provides excellent tools for REST API scraping, from the built-in Net::HTTP to powerful gems like HTTParty and Faraday. By implementing proper authentication, error handling, rate limiting, and data processing, you can build robust and efficient API scrapers that respect service providers while gathering the data you need.
Remember to always review API terms of service, implement respectful scraping practices, and consider the legal and ethical implications of your data collection activities. With these techniques and best practices, you'll be well-equipped to handle most API scraping scenarios in Ruby.