Is it possible to scrape data from PDFs using Ruby?

Yes, it is absolutely possible to scrape data from PDFs using Ruby. Ruby offers several powerful gems for PDF processing, from simple text extraction to advanced data parsing and OCR capabilities.

Popular Ruby PDF Processing Gems

1. pdf-reader (Most Popular)

The pdf-reader gem is the most widely used library for reading PDF files in Ruby.

2. prawn (For PDF Creation and Manipulation)

While primarily for PDF creation, prawn can also be used for some PDF processing tasks.

3. rtesseract (For OCR)

Used for extracting text from image-based PDFs using Tesseract OCR.

Installation

Add the gems to your Gemfile or install directly:

# Gemfile
gem 'pdf-reader'
gem 'rtesseract'  # Optional, for OCR functionality

# Or install directly
gem install pdf-reader
gem install rtesseract

Basic Text Extraction

Simple Text Extraction

require 'pdf-reader'

def extract_text_from_pdf(file_path)
  reader = PDF::Reader.new(file_path)
  text_content = ""

  reader.pages.each_with_index do |page, index|
    puts "Processing page #{index + 1}..."
    text_content += page.text + "\n"
  end

  text_content
rescue PDF::Reader::MalformedPDFError => e
  puts "Error reading PDF: #{e.message}"
  nil
end

# Usage
pdf_text = extract_text_from_pdf("document.pdf")
puts pdf_text

Extracting Text from Specific Pages

require 'pdf-reader'

def extract_text_from_pages(file_path, page_numbers)
  reader = PDF::Reader.new(file_path)
  extracted_text = {}

  page_numbers.each do |page_num|
    if page_num <= reader.page_count
      page = reader.pages[page_num - 1]  # Pages are 0-indexed
      extracted_text[page_num] = page.text
    end
  end

  extracted_text
end

# Extract text from pages 1, 3, and 5
text_by_page = extract_text_from_pages("document.pdf", [1, 3, 5])
text_by_page.each do |page_num, text|
  puts "Page #{page_num}:"
  puts text
  puts "-" * 50
end

Advanced Data Extraction

Pattern Matching and Data Extraction

require 'pdf-reader'

class PDFDataExtractor
  def initialize(file_path)
    @reader = PDF::Reader.new(file_path)
  end

  def extract_emails
    email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
    extract_pattern(email_pattern)
  end

  def extract_phone_numbers
    phone_pattern = /\b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b/
    extract_pattern(phone_pattern)
  end

  def extract_dates
    date_pattern = /\b\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4}\b/
    extract_pattern(date_pattern)
  end

  def extract_currency
    currency_pattern = /\$[\d,]+\.?\d*/
    extract_pattern(currency_pattern)
  end

  private

  def extract_pattern(pattern)
    matches = []
    @reader.pages.each_with_index do |page, index|
      page_matches = page.text.scan(pattern).flatten
      page_matches.each { |match| matches << { page: index + 1, text: match } }
    end
    matches
  end
end

# Usage
extractor = PDFDataExtractor.new("invoice.pdf")

puts "Emails found:"
extractor.extract_emails.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }

puts "\nPhone numbers found:"
extractor.extract_phone_numbers.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }

puts "\nDates found:"
extractor.extract_dates.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }

Structured Data Extraction

require 'pdf-reader'

class InvoiceExtractor
  def initialize(file_path)
    @reader = PDF::Reader.new(file_path)
    @full_text = @reader.pages.map(&:text).join("\n")
  end

  def extract_invoice_data
    {
      invoice_number: extract_invoice_number,
      date: extract_date,
      total_amount: extract_total_amount,
      vendor: extract_vendor,
      line_items: extract_line_items
    }
  end

  private

  def extract_invoice_number
    match = @full_text.match(/Invoice\s*#?\s*:?\s*(\w+)/i)
    match ? match[1] : nil
  end

  def extract_date
    match = @full_text.match(/Date\s*:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})/i)
    match ? match[1] : nil
  end

  def extract_total_amount
    match = @full_text.match(/Total\s*:?\s*\$?([\d,]+\.?\d*)/i)
    match ? match[1].gsub(',', '').to_f : nil
  end

  def extract_vendor
    # This would depend on the specific format of your invoices
    lines = @full_text.split("\n")
    # Logic to identify vendor based on position or keywords
    lines.find { |line| line.match(/vendor|company|from/i) }
  end

  def extract_line_items
    # Extract table-like data - this is format-specific
    items = []
    lines = @full_text.split("\n")

    lines.each do |line|
      # Example pattern for line items: Description Qty Price Total
      if match = line.match(/(.+?)\s+(\d+)\s+\$?([\d,]+\.?\d*)\s+\$?([\d,]+\.?\d*)/)
        items << {
          description: match[1].strip,
          quantity: match[2].to_i,
          price: match[3].gsub(',', '').to_f,
          total: match[4].gsub(',', '').to_f
        }
      end
    end

    items
  end
end

# Usage
invoice = InvoiceExtractor.new("invoice.pdf")
data = invoice.extract_invoice_data
puts data.inspect

Handling Different PDF Types

Text-Based PDFs vs Image-Based PDFs

require 'pdf-reader'
require 'rtesseract'

class PDFProcessor
  def initialize(file_path)
    @file_path = file_path
    @reader = PDF::Reader.new(file_path)
  end

  def extract_text
    # First, try to extract text directly
    text = extract_text_directly

    if text.strip.empty? || text.length < 50
      puts "PDF appears to be image-based, attempting OCR..."
      extract_text_with_ocr
    else
      text
    end
  end

  private

  def extract_text_directly
    @reader.pages.map(&:text).join("\n")
  end

  def extract_text_with_ocr
    # This is a simplified example - you'd need additional gems
    # like mini_magick to convert PDF pages to images first

    # For image-based PDFs, you'd typically:
    # 1. Convert PDF pages to images using ImageMagick
    # 2. Use Tesseract OCR to extract text from images

    begin
      # Assuming you have an image file extracted from the PDF
      image_path = convert_pdf_to_image(@file_path)
      rtesseract = RTesseract.new(image_path)
      rtesseract.to_s
    rescue => e
      puts "OCR failed: #{e.message}"
      ""
    end
  end

  def convert_pdf_to_image(pdf_path)
    # This would require additional setup with ImageMagick
    # This is pseudocode - actual implementation would be more complex
    "converted_page.png"
  end
end

Error Handling and Best Practices

require 'pdf-reader'

class RobustPDFExtractor
  def initialize(file_path)
    @file_path = file_path
    validate_file
  end

  def extract_text_safely
    begin
      reader = PDF::Reader.new(@file_path)

      unless reader.respond_to?(:pages)
        raise "Invalid PDF format"
      end

      text_content = ""
      reader.pages.each_with_index do |page, index|
        begin
          page_text = page.text
          text_content += "=== Page #{index + 1} ===\n#{page_text}\n\n"
        rescue => page_error
          puts "Error processing page #{index + 1}: #{page_error.message}"
          text_content += "=== Page #{index + 1} ===\n[Error reading page]\n\n"
        end
      end

      text_content

    rescue PDF::Reader::MalformedPDFError => e
      handle_malformed_pdf(e)
    rescue PDF::Reader::UnsupportedFeatureError => e
      handle_unsupported_feature(e)
    rescue => e
      handle_general_error(e)
    end
  end

  def get_pdf_info
    reader = PDF::Reader.new(@file_path)
    {
      page_count: reader.page_count,
      pdf_version: reader.pdf_version,
      info: reader.info,
      metadata: reader.metadata
    }
  rescue => e
    puts "Error getting PDF info: #{e.message}"
    {}
  end

  private

  def validate_file
    unless File.exist?(@file_path)
      raise "File not found: #{@file_path}"
    end

    unless File.extname(@file_path).downcase == '.pdf'
      puts "Warning: File doesn't have .pdf extension"
    end

    if File.size(@file_path) == 0
      raise "File is empty: #{@file_path}"
    end
  end

  def handle_malformed_pdf(error)
    puts "PDF is malformed or corrupted: #{error.message}"
    puts "Try using a PDF repair tool or contact the document source."
    nil
  end

  def handle_unsupported_feature(error)
    puts "PDF contains unsupported features: #{error.message}"
    puts "This PDF may use advanced features not supported by pdf-reader."
    nil
  end

  def handle_general_error(error)
    puts "Unexpected error: #{error.message}"
    puts "Backtrace: #{error.backtrace.join("\n")}"
    nil
  end
end

# Usage
extractor = RobustPDFExtractor.new("document.pdf")
puts extractor.get_pdf_info
text = extractor.extract_text_safely
puts text if text

OCR for Image-Based PDFs

For PDFs that contain scanned images or are image-based, you'll need OCR:

require 'rtesseract'
require 'mini_magick'  # For image processing

class OCRPDFExtractor
  def initialize(file_path)
    @file_path = file_path
  end

  def extract_with_ocr
    # Convert PDF to images first (requires ImageMagick)
    image_files = convert_pdf_to_images

    extracted_text = ""
    image_files.each_with_index do |image_file, index|
      puts "Processing page #{index + 1} with OCR..."

      # Configure Tesseract for better accuracy
      rtesseract = RTesseract.new(image_file, lang: 'eng')
      rtesseract.config = '--psm 6'  # Uniform block of text

      page_text = rtesseract.to_s
      extracted_text += "=== Page #{index + 1} ===\n#{page_text}\n\n"

      # Clean up temporary image file
      File.delete(image_file) if File.exist?(image_file)
    end

    extracted_text
  rescue => e
    puts "OCR extraction failed: #{e.message}"
    nil
  end

  private

  def convert_pdf_to_images
    # This requires ImageMagick to be installed
    # Convert PDF to PNG images
    images = []

    begin
      # Use MiniMagick to convert PDF pages to images
      image = MiniMagick::Image.open(@file_path)
      image.format "png"
      image.density "300"  # High DPI for better OCR

      # For multi-page PDFs, this creates multiple files
      base_name = File.basename(@file_path, '.pdf')
      temp_dir = Dir.mktmpdir

      # This is simplified - actual implementation would handle multi-page PDFs
      output_path = File.join(temp_dir, "#{base_name}.png")
      image.write(output_path)

      [output_path]
    rescue => e
      puts "Error converting PDF to images: #{e.message}"
      []
    end
  end
end

Performance Considerations

Large PDFs: Process pages in batches for memory efficiency
OCR: Very CPU-intensive, consider background processing
Caching: Cache extracted text to avoid re-processing
File validation: Always validate PDF files before processing

Limitations and Considerations

PDF Structure: Text extraction works best with text-based PDFs
Complex Layouts: Tables and multi-column layouts may not preserve formatting
Encrypted PDFs: Password-protected PDFs require additional handling
OCR Accuracy: Image-based text extraction can be error-prone
Performance: Large PDFs can consume significant memory and processing time

Ruby provides excellent tools for PDF data extraction, but success depends on the PDF structure and content format. Start with pdf-reader for text-based PDFs and add OCR capabilities when needed.

Table of contents

Is it possible to scrape data from PDFs using Ruby?

Popular Ruby PDF Processing Gems

1. pdf-reader (Most Popular)

2. prawn (For PDF Creation and Manipulation)

3. rtesseract (For OCR)

Installation

Basic Text Extraction

Simple Text Extraction

Extracting Text from Specific Pages

Advanced Data Extraction

Pattern Matching and Data Extraction

Structured Data Extraction

Handling Different PDF Types

Text-Based PDFs vs Image-Based PDFs

Error Handling and Best Practices

OCR for Image-Based PDFs

Performance Considerations

Limitations and Considerations

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I install and set up Nokogiri for web scraping in Ruby?

What is the difference between Nokogiri and Mechanize for Ruby web scraping?

How do I handle HTTP redirects when scraping with Ruby?

Get Started Now

Support

Support