Yes, it is absolutely possible to scrape data from PDFs using Ruby. Ruby offers several powerful gems for PDF processing, from simple text extraction to advanced data parsing and OCR capabilities.
Popular Ruby PDF Processing Gems
1. pdf-reader (Most Popular)
The pdf-reader
gem is the most widely used library for reading PDF files in Ruby.
2. prawn (For PDF Creation and Manipulation)
While primarily for PDF creation, prawn
can also be used for some PDF processing tasks.
3. rtesseract (For OCR)
Used for extracting text from image-based PDFs using Tesseract OCR.
Installation
Add the gems to your Gemfile or install directly:
# Gemfile
gem 'pdf-reader'
gem 'rtesseract' # Optional, for OCR functionality
# Or install directly
gem install pdf-reader
gem install rtesseract
Basic Text Extraction
Simple Text Extraction
require 'pdf-reader'
def extract_text_from_pdf(file_path)
reader = PDF::Reader.new(file_path)
text_content = ""
reader.pages.each_with_index do |page, index|
puts "Processing page #{index + 1}..."
text_content += page.text + "\n"
end
text_content
rescue PDF::Reader::MalformedPDFError => e
puts "Error reading PDF: #{e.message}"
nil
end
# Usage
pdf_text = extract_text_from_pdf("document.pdf")
puts pdf_text
Extracting Text from Specific Pages
require 'pdf-reader'
def extract_text_from_pages(file_path, page_numbers)
reader = PDF::Reader.new(file_path)
extracted_text = {}
page_numbers.each do |page_num|
if page_num <= reader.page_count
page = reader.pages[page_num - 1] # Pages are 0-indexed
extracted_text[page_num] = page.text
end
end
extracted_text
end
# Extract text from pages 1, 3, and 5
text_by_page = extract_text_from_pages("document.pdf", [1, 3, 5])
text_by_page.each do |page_num, text|
puts "Page #{page_num}:"
puts text
puts "-" * 50
end
Advanced Data Extraction
Pattern Matching and Data Extraction
require 'pdf-reader'
class PDFDataExtractor
def initialize(file_path)
@reader = PDF::Reader.new(file_path)
end
def extract_emails
email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
extract_pattern(email_pattern)
end
def extract_phone_numbers
phone_pattern = /\b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b/
extract_pattern(phone_pattern)
end
def extract_dates
date_pattern = /\b\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4}\b/
extract_pattern(date_pattern)
end
def extract_currency
currency_pattern = /\$[\d,]+\.?\d*/
extract_pattern(currency_pattern)
end
private
def extract_pattern(pattern)
matches = []
@reader.pages.each_with_index do |page, index|
page_matches = page.text.scan(pattern).flatten
page_matches.each { |match| matches << { page: index + 1, text: match } }
end
matches
end
end
# Usage
extractor = PDFDataExtractor.new("invoice.pdf")
puts "Emails found:"
extractor.extract_emails.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }
puts "\nPhone numbers found:"
extractor.extract_phone_numbers.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }
puts "\nDates found:"
extractor.extract_dates.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }
Structured Data Extraction
require 'pdf-reader'
class InvoiceExtractor
def initialize(file_path)
@reader = PDF::Reader.new(file_path)
@full_text = @reader.pages.map(&:text).join("\n")
end
def extract_invoice_data
{
invoice_number: extract_invoice_number,
date: extract_date,
total_amount: extract_total_amount,
vendor: extract_vendor,
line_items: extract_line_items
}
end
private
def extract_invoice_number
match = @full_text.match(/Invoice\s*#?\s*:?\s*(\w+)/i)
match ? match[1] : nil
end
def extract_date
match = @full_text.match(/Date\s*:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})/i)
match ? match[1] : nil
end
def extract_total_amount
match = @full_text.match(/Total\s*:?\s*\$?([\d,]+\.?\d*)/i)
match ? match[1].gsub(',', '').to_f : nil
end
def extract_vendor
# This would depend on the specific format of your invoices
lines = @full_text.split("\n")
# Logic to identify vendor based on position or keywords
lines.find { |line| line.match(/vendor|company|from/i) }
end
def extract_line_items
# Extract table-like data - this is format-specific
items = []
lines = @full_text.split("\n")
lines.each do |line|
# Example pattern for line items: Description Qty Price Total
if match = line.match(/(.+?)\s+(\d+)\s+\$?([\d,]+\.?\d*)\s+\$?([\d,]+\.?\d*)/)
items << {
description: match[1].strip,
quantity: match[2].to_i,
price: match[3].gsub(',', '').to_f,
total: match[4].gsub(',', '').to_f
}
end
end
items
end
end
# Usage
invoice = InvoiceExtractor.new("invoice.pdf")
data = invoice.extract_invoice_data
puts data.inspect
Handling Different PDF Types
Text-Based PDFs vs Image-Based PDFs
require 'pdf-reader'
require 'rtesseract'
class PDFProcessor
def initialize(file_path)
@file_path = file_path
@reader = PDF::Reader.new(file_path)
end
def extract_text
# First, try to extract text directly
text = extract_text_directly
if text.strip.empty? || text.length < 50
puts "PDF appears to be image-based, attempting OCR..."
extract_text_with_ocr
else
text
end
end
private
def extract_text_directly
@reader.pages.map(&:text).join("\n")
end
def extract_text_with_ocr
# This is a simplified example - you'd need additional gems
# like mini_magick to convert PDF pages to images first
# For image-based PDFs, you'd typically:
# 1. Convert PDF pages to images using ImageMagick
# 2. Use Tesseract OCR to extract text from images
begin
# Assuming you have an image file extracted from the PDF
image_path = convert_pdf_to_image(@file_path)
rtesseract = RTesseract.new(image_path)
rtesseract.to_s
rescue => e
puts "OCR failed: #{e.message}"
""
end
end
def convert_pdf_to_image(pdf_path)
# This would require additional setup with ImageMagick
# This is pseudocode - actual implementation would be more complex
"converted_page.png"
end
end
Error Handling and Best Practices
require 'pdf-reader'
class RobustPDFExtractor
def initialize(file_path)
@file_path = file_path
validate_file
end
def extract_text_safely
begin
reader = PDF::Reader.new(@file_path)
unless reader.respond_to?(:pages)
raise "Invalid PDF format"
end
text_content = ""
reader.pages.each_with_index do |page, index|
begin
page_text = page.text
text_content += "=== Page #{index + 1} ===\n#{page_text}\n\n"
rescue => page_error
puts "Error processing page #{index + 1}: #{page_error.message}"
text_content += "=== Page #{index + 1} ===\n[Error reading page]\n\n"
end
end
text_content
rescue PDF::Reader::MalformedPDFError => e
handle_malformed_pdf(e)
rescue PDF::Reader::UnsupportedFeatureError => e
handle_unsupported_feature(e)
rescue => e
handle_general_error(e)
end
end
def get_pdf_info
reader = PDF::Reader.new(@file_path)
{
page_count: reader.page_count,
pdf_version: reader.pdf_version,
info: reader.info,
metadata: reader.metadata
}
rescue => e
puts "Error getting PDF info: #{e.message}"
{}
end
private
def validate_file
unless File.exist?(@file_path)
raise "File not found: #{@file_path}"
end
unless File.extname(@file_path).downcase == '.pdf'
puts "Warning: File doesn't have .pdf extension"
end
if File.size(@file_path) == 0
raise "File is empty: #{@file_path}"
end
end
def handle_malformed_pdf(error)
puts "PDF is malformed or corrupted: #{error.message}"
puts "Try using a PDF repair tool or contact the document source."
nil
end
def handle_unsupported_feature(error)
puts "PDF contains unsupported features: #{error.message}"
puts "This PDF may use advanced features not supported by pdf-reader."
nil
end
def handle_general_error(error)
puts "Unexpected error: #{error.message}"
puts "Backtrace: #{error.backtrace.join("\n")}"
nil
end
end
# Usage
extractor = RobustPDFExtractor.new("document.pdf")
puts extractor.get_pdf_info
text = extractor.extract_text_safely
puts text if text
OCR for Image-Based PDFs
For PDFs that contain scanned images or are image-based, you'll need OCR:
require 'rtesseract'
require 'mini_magick' # For image processing
class OCRPDFExtractor
def initialize(file_path)
@file_path = file_path
end
def extract_with_ocr
# Convert PDF to images first (requires ImageMagick)
image_files = convert_pdf_to_images
extracted_text = ""
image_files.each_with_index do |image_file, index|
puts "Processing page #{index + 1} with OCR..."
# Configure Tesseract for better accuracy
rtesseract = RTesseract.new(image_file, lang: 'eng')
rtesseract.config = '--psm 6' # Uniform block of text
page_text = rtesseract.to_s
extracted_text += "=== Page #{index + 1} ===\n#{page_text}\n\n"
# Clean up temporary image file
File.delete(image_file) if File.exist?(image_file)
end
extracted_text
rescue => e
puts "OCR extraction failed: #{e.message}"
nil
end
private
def convert_pdf_to_images
# This requires ImageMagick to be installed
# Convert PDF to PNG images
images = []
begin
# Use MiniMagick to convert PDF pages to images
image = MiniMagick::Image.open(@file_path)
image.format "png"
image.density "300" # High DPI for better OCR
# For multi-page PDFs, this creates multiple files
base_name = File.basename(@file_path, '.pdf')
temp_dir = Dir.mktmpdir
# This is simplified - actual implementation would handle multi-page PDFs
output_path = File.join(temp_dir, "#{base_name}.png")
image.write(output_path)
[output_path]
rescue => e
puts "Error converting PDF to images: #{e.message}"
[]
end
end
end
Performance Considerations
- Large PDFs: Process pages in batches for memory efficiency
- OCR: Very CPU-intensive, consider background processing
- Caching: Cache extracted text to avoid re-processing
- File validation: Always validate PDF files before processing
Limitations and Considerations
- PDF Structure: Text extraction works best with text-based PDFs
- Complex Layouts: Tables and multi-column layouts may not preserve formatting
- Encrypted PDFs: Password-protected PDFs require additional handling
- OCR Accuracy: Image-based text extraction can be error-prone
- Performance: Large PDFs can consume significant memory and processing time
Ruby provides excellent tools for PDF data extraction, but success depends on the PDF structure and content format. Start with pdf-reader
for text-based PDFs and add OCR capabilities when needed.