What methods are available for parsing JSON responses with Mechanize?
When working with APIs and modern web applications, you'll frequently encounter JSON responses. Mechanize provides several effective methods for handling and parsing JSON data. This guide covers the most practical approaches for working with JSON responses in your web scraping and automation projects.
Basic JSON Parsing with Mechanize
Mechanize treats JSON responses as Mechanize::File
objects by default. You can access the raw JSON content and parse it using Ruby's built-in JSON library.
Simple JSON Response Parsing
require 'mechanize'
require 'json'
agent = Mechanize.new
response = agent.get('https://api.example.com/users')
# Parse JSON from the response body
json_data = JSON.parse(response.body)
puts json_data['users'].first['name']
Complete Example with Error Handling
require 'mechanize'
require 'json'
class JSONScraper
def initialize
@agent = Mechanize.new
configure_agent
end
def fetch_json(url)
begin
response = @agent.get(url)
# Verify content type
unless response.response['content-type']&.include?('application/json')
puts "Warning: Response may not be JSON"
end
# Parse JSON safely
JSON.parse(response.body)
rescue JSON::ParserError => e
puts "JSON parsing error: #{e.message}"
nil
rescue Mechanize::ResponseCodeError => e
puts "HTTP error: #{e.response_code}"
nil
rescue => e
puts "Unexpected error: #{e.message}"
nil
end
end
private
def configure_agent
@agent.user_agent = 'Mozilla/5.0 (compatible; Ruby JSON Scraper)'
@agent.request_headers = {
'Accept' => 'application/json, text/plain, */*',
'Content-Type' => 'application/json'
}
end
end
# Usage
scraper = JSONScraper.new
data = scraper.fetch_json('https://jsonplaceholder.typicode.com/posts')
if data
puts "Fetched #{data.length} posts"
puts "First post title: #{data.first['title']}"
end
Advanced JSON Parsing Techniques
Using Custom Content Type Handlers
You can configure Mechanize to automatically handle JSON responses with custom parsers:
require 'mechanize'
require 'json'
# Create a custom JSON page class
class JSONPage < Mechanize::File
def json
@json ||= JSON.parse(body)
rescue JSON::ParserError => e
puts "Failed to parse JSON: #{e.message}"
nil
end
def [](key)
json&.dig(key)
end
def dig(*keys)
json&.dig(*keys)
end
end
# Configure Mechanize to use custom parser for JSON
agent = Mechanize.new
agent.pluggable_parser['application/json'] = JSONPage
agent.pluggable_parser['text/json'] = JSONPage
# Now JSON responses are automatically parsed
response = agent.get('https://api.example.com/data.json')
# Access JSON data directly
puts response.json['status']
puts response['users']
puts response.dig('data', 'items', 0, 'name')
Handling Different JSON Response Formats
require 'mechanize'
require 'json'
class FlexibleJSONHandler
def initialize
@agent = Mechanize.new
setup_headers
end
def parse_response(url)
response = @agent.get(url)
content_type = response.response['content-type']
case content_type
when /application\/json/
handle_json_response(response)
when /text\/javascript/, /application\/javascript/
handle_jsonp_response(response)
when /text\/html/
extract_json_from_html(response)
else
attempt_json_parse(response)
end
end
private
def setup_headers
@agent.request_headers = {
'Accept' => 'application/json, application/javascript, text/javascript, text/html, */*',
'X-Requested-With' => 'XMLHttpRequest'
}
end
def handle_json_response(response)
JSON.parse(response.body)
rescue JSON::ParserError => e
puts "JSON parsing failed: #{e.message}"
nil
end
def handle_jsonp_response(response)
# Extract JSON from JSONP response
jsonp_body = response.body
# Remove JSONP callback wrapper
json_match = jsonp_body.match(/\w+\((.*)\)/)
return nil unless json_match
JSON.parse(json_match[1])
rescue JSON::ParserError => e
puts "JSONP parsing failed: #{e.message}"
nil
end
def extract_json_from_html(response)
# Look for JSON in script tags
doc = response
script_tags = doc.search('script[type="application/json"]')
script_tags.each do |script|
begin
return JSON.parse(script.text)
rescue JSON::ParserError
next
end
end
# Look for JSON in data attributes
elements_with_data = doc.search('[data-json]')
elements_with_data.each do |element|
begin
return JSON.parse(element['data-json'])
rescue JSON::ParserError
next
end
end
nil
end
def attempt_json_parse(response)
# Last resort: try to parse as JSON regardless of content type
JSON.parse(response.body)
rescue JSON::ParserError
puts "Could not parse response as JSON"
nil
end
end
Working with API Authentication
Bearer Token Authentication
require 'mechanize'
require 'json'
class AuthenticatedAPIClient
def initialize(token)
@agent = Mechanize.new
@token = token
setup_authentication
end
def get_json(endpoint)
response = @agent.get(endpoint)
JSON.parse(response.body)
rescue JSON::ParserError => e
puts "JSON parsing error: #{e.message}"
nil
rescue Mechanize::UnauthorizedError
puts "Authentication failed - check your token"
nil
end
def post_json(endpoint, data)
@agent.post(endpoint, data.to_json, {
'Content-Type' => 'application/json'
})
end
private
def setup_authentication
@agent.request_headers = {
'Authorization' => "Bearer #{@token}",
'Accept' => 'application/json',
'Content-Type' => 'application/json'
}
end
end
# Usage
client = AuthenticatedAPIClient.new('your-api-token')
user_data = client.get_json('https://api.example.com/user/profile')
Basic Authentication with JSON APIs
require 'mechanize'
require 'json'
require 'base64'
class BasicAuthJSONClient
def initialize(username, password)
@agent = Mechanize.new
setup_basic_auth(username, password)
end
def fetch_json_data(url)
begin
response = @agent.get(url)
# Validate response is JSON
content_type = response.response['content-type']
unless content_type&.include?('json')
puts "Warning: Expected JSON, got #{content_type}"
end
JSON.parse(response.body)
rescue Net::HTTPUnauthorized
puts "Authentication failed"
nil
rescue JSON::ParserError => e
puts "JSON parsing failed: #{e.message}"
response.body # Return raw body for debugging
end
end
private
def setup_basic_auth(username, password)
credentials = Base64.strict_encode64("#{username}:#{password}")
@agent.request_headers = {
'Authorization' => "Basic #{credentials}",
'Accept' => 'application/json'
}
end
end
Handling Large JSON Responses
Streaming JSON Parser for Large Files
require 'mechanize'
require 'json'
class StreamingJSONParser
def initialize
@agent = Mechanize.new
configure_for_large_files
end
def parse_large_json(url, &block)
response = @agent.get(url)
# For very large JSON files, consider streaming
if response.body.length > 10_000_000 # 10MB
parse_streaming(response.body, &block)
else
parse_standard(response.body, &block)
end
end
private
def configure_for_large_files
@agent.read_timeout = 300 # 5 minutes for large files
@agent.gzip_enabled = true
end
def parse_streaming(json_string, &block)
# For extremely large JSON, you might want to use a streaming parser
# like Yajl or Ox, but here's a chunked approach
data = JSON.parse(json_string)
if data.is_a?(Array)
data.each_slice(1000) do |chunk|
yield chunk
end
else
yield data
end
rescue JSON::ParserError => e
puts "Streaming JSON parse error: #{e.message}"
end
def parse_standard(json_string, &block)
data = JSON.parse(json_string)
yield data
rescue JSON::ParserError => e
puts "Standard JSON parse error: #{e.message}"
end
end
# Usage
parser = StreamingJSONParser.new
parser.parse_large_json('https://api.example.com/large-dataset') do |data_chunk|
# Process each chunk
puts "Processing #{data_chunk.length} items"
end
Error Handling and Validation
Comprehensive JSON Response Validation
require 'mechanize'
require 'json'
class ValidatedJSONFetcher
def initialize
@agent = Mechanize.new
setup_agent
end
def fetch_and_validate(url, expected_schema = nil)
response = fetch_response(url)
return nil unless response
json_data = parse_json(response)
return nil unless json_data
if expected_schema
validate_schema(json_data, expected_schema)
else
json_data
end
end
private
def setup_agent
@agent.user_agent = 'Mozilla/5.0 (compatible; JSON Validator)'
@agent.open_timeout = 15
@agent.read_timeout = 30
end
def fetch_response(url)
@agent.get(url)
rescue Mechanize::ResponseCodeError => e
log_error("HTTP Error #{e.response_code} for #{url}")
nil
rescue Net::TimeoutError => e
log_error("Timeout error for #{url}")
nil
rescue => e
log_error("Unexpected error: #{e.message}")
nil
end
def parse_json(response)
# Validate content type
content_type = response.response['content-type']
unless content_type&.match?(/json/)
log_error("Expected JSON content type, got: #{content_type}")
return nil
end
JSON.parse(response.body)
rescue JSON::ParserError => e
log_error("JSON parsing failed: #{e.message}")
log_error("Response body preview: #{response.body[0..200]}...")
nil
end
def validate_schema(data, schema)
# Basic schema validation
schema.each do |key, expected_type|
unless data.key?(key.to_s)
log_error("Missing required key: #{key}")
return nil
end
actual_value = data[key.to_s]
unless actual_value.is_a?(expected_type)
log_error("Type mismatch for #{key}: expected #{expected_type}, got #{actual_value.class}")
return nil
end
end
data
end
def log_error(message)
puts "[ERROR] #{Time.now}: #{message}"
end
end
# Usage with schema validation
fetcher = ValidatedJSONFetcher.new
schema = {
id: Integer,
name: String,
email: String,
active: TrueClass # or FalseClass
}
user_data = fetcher.fetch_and_validate(
'https://api.example.com/user/123',
schema
)
Integration with Modern Web Applications
When dealing with single-page applications or complex web interfaces, you might need more sophisticated approaches. For JavaScript-heavy applications, consider complementing Mechanize with browser automation tools like handling AJAX requests using Puppeteer for comprehensive data extraction.
Working with API Endpoints Behind Authentication
require 'mechanize'
require 'json'
class SessionBasedJSONAPI
def initialize
@agent = Mechanize.new
@logged_in = false
end
def login_and_fetch_json(login_url, username, password, api_endpoint)
unless @logged_in
perform_login(login_url, username, password)
end
fetch_authenticated_json(api_endpoint)
end
private
def perform_login(login_url, username, password)
login_page = @agent.get(login_url)
form = login_page.form_with(name: 'login') || login_page.forms.first
form.username = username
form.password = password
result = @agent.submit(form)
@logged_in = result.uri.path != login_page.uri.path
unless @logged_in
raise "Login failed"
end
end
def fetch_authenticated_json(endpoint)
response = @agent.get(endpoint)
JSON.parse(response.body)
rescue JSON::ParserError => e
puts "JSON parsing error: #{e.message}"
nil
end
end
Performance Optimization for JSON Processing
Efficient JSON Parsing with Memory Management
require 'mechanize'
require 'json'
class OptimizedJSONProcessor
def initialize
@agent = Mechanize.new
optimize_agent
end
def process_multiple_endpoints(urls)
results = []
urls.each_with_index do |url, index|
puts "Processing #{index + 1}/#{urls.length}: #{url}"
json_data = fetch_and_parse(url)
if json_data
# Process immediately to reduce memory usage
processed_data = extract_essential_data(json_data)
results << processed_data
# Clear references to help GC
json_data = nil
end
# Periodic garbage collection for long-running processes
GC.start if (index + 1) % 10 == 0
end
results
end
private
def optimize_agent
@agent.gzip_enabled = true
@agent.keep_alive = true
@agent.user_agent = 'Mozilla/5.0 (compatible; Optimized JSON Processor)'
end
def fetch_and_parse(url)
response = @agent.get(url)
JSON.parse(response.body)
rescue => e
puts "Error processing #{url}: #{e.message}"
nil
end
def extract_essential_data(json_data)
# Extract only what you need to reduce memory usage
{
id: json_data['id'],
title: json_data['title'],
created_at: json_data['created_at']
# Don't store the entire JSON object
}
end
end
Conclusion
Mechanize provides robust methods for parsing JSON responses, from simple API calls to complex authentication scenarios. The key approaches include:
- Basic parsing with Ruby's JSON library
- Custom content type handlers for automatic JSON processing
- Authentication integration for protected APIs
- Error handling and validation for robust applications
- Performance optimization for large-scale processing
While Mechanize excels at JSON API interactions, remember that for JavaScript-heavy applications requiring dynamic content loading, you might need to combine it with browser automation tools like monitoring network requests in Puppeteer.
For complex web scraping projects requiring both JSON API access and JavaScript execution, consider using the WebScraping.AI API which provides comprehensive scraping capabilities including JavaScript rendering and automatic response parsing.