Table of contents

Is it possible to scrape data from PDFs using Ruby?

Yes, it is absolutely possible to scrape data from PDFs using Ruby. Ruby offers several powerful gems for PDF processing, from simple text extraction to advanced data parsing and OCR capabilities.

Popular Ruby PDF Processing Gems

1. pdf-reader (Most Popular)

The pdf-reader gem is the most widely used library for reading PDF files in Ruby.

2. prawn (For PDF Creation and Manipulation)

While primarily for PDF creation, prawn can also be used for some PDF processing tasks.

3. rtesseract (For OCR)

Used for extracting text from image-based PDFs using Tesseract OCR.

Installation

Add the gems to your Gemfile or install directly:

# Gemfile
gem 'pdf-reader'
gem 'rtesseract'  # Optional, for OCR functionality

# Or install directly
gem install pdf-reader
gem install rtesseract

Basic Text Extraction

Simple Text Extraction

require 'pdf-reader'

def extract_text_from_pdf(file_path)
  reader = PDF::Reader.new(file_path)
  text_content = ""

  reader.pages.each_with_index do |page, index|
    puts "Processing page #{index + 1}..."
    text_content += page.text + "\n"
  end

  text_content
rescue PDF::Reader::MalformedPDFError => e
  puts "Error reading PDF: #{e.message}"
  nil
end

# Usage
pdf_text = extract_text_from_pdf("document.pdf")
puts pdf_text

Extracting Text from Specific Pages

require 'pdf-reader'

def extract_text_from_pages(file_path, page_numbers)
  reader = PDF::Reader.new(file_path)
  extracted_text = {}

  page_numbers.each do |page_num|
    if page_num <= reader.page_count
      page = reader.pages[page_num - 1]  # Pages are 0-indexed
      extracted_text[page_num] = page.text
    end
  end

  extracted_text
end

# Extract text from pages 1, 3, and 5
text_by_page = extract_text_from_pages("document.pdf", [1, 3, 5])
text_by_page.each do |page_num, text|
  puts "Page #{page_num}:"
  puts text
  puts "-" * 50
end

Advanced Data Extraction

Pattern Matching and Data Extraction

require 'pdf-reader'

class PDFDataExtractor
  def initialize(file_path)
    @reader = PDF::Reader.new(file_path)
  end

  def extract_emails
    email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
    extract_pattern(email_pattern)
  end

  def extract_phone_numbers
    phone_pattern = /\b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b/
    extract_pattern(phone_pattern)
  end

  def extract_dates
    date_pattern = /\b\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4}\b/
    extract_pattern(date_pattern)
  end

  def extract_currency
    currency_pattern = /\$[\d,]+\.?\d*/
    extract_pattern(currency_pattern)
  end

  private

  def extract_pattern(pattern)
    matches = []
    @reader.pages.each_with_index do |page, index|
      page_matches = page.text.scan(pattern).flatten
      page_matches.each { |match| matches << { page: index + 1, text: match } }
    end
    matches
  end
end

# Usage
extractor = PDFDataExtractor.new("invoice.pdf")

puts "Emails found:"
extractor.extract_emails.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }

puts "\nPhone numbers found:"
extractor.extract_phone_numbers.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }

puts "\nDates found:"
extractor.extract_dates.each { |match| puts "Page #{match[:page]}: #{match[:text]}" }

Structured Data Extraction

require 'pdf-reader'

class InvoiceExtractor
  def initialize(file_path)
    @reader = PDF::Reader.new(file_path)
    @full_text = @reader.pages.map(&:text).join("\n")
  end

  def extract_invoice_data
    {
      invoice_number: extract_invoice_number,
      date: extract_date,
      total_amount: extract_total_amount,
      vendor: extract_vendor,
      line_items: extract_line_items
    }
  end

  private

  def extract_invoice_number
    match = @full_text.match(/Invoice\s*#?\s*:?\s*(\w+)/i)
    match ? match[1] : nil
  end

  def extract_date
    match = @full_text.match(/Date\s*:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})/i)
    match ? match[1] : nil
  end

  def extract_total_amount
    match = @full_text.match(/Total\s*:?\s*\$?([\d,]+\.?\d*)/i)
    match ? match[1].gsub(',', '').to_f : nil
  end

  def extract_vendor
    # This would depend on the specific format of your invoices
    lines = @full_text.split("\n")
    # Logic to identify vendor based on position or keywords
    lines.find { |line| line.match(/vendor|company|from/i) }
  end

  def extract_line_items
    # Extract table-like data - this is format-specific
    items = []
    lines = @full_text.split("\n")

    lines.each do |line|
      # Example pattern for line items: Description Qty Price Total
      if match = line.match(/(.+?)\s+(\d+)\s+\$?([\d,]+\.?\d*)\s+\$?([\d,]+\.?\d*)/)
        items << {
          description: match[1].strip,
          quantity: match[2].to_i,
          price: match[3].gsub(',', '').to_f,
          total: match[4].gsub(',', '').to_f
        }
      end
    end

    items
  end
end

# Usage
invoice = InvoiceExtractor.new("invoice.pdf")
data = invoice.extract_invoice_data
puts data.inspect

Handling Different PDF Types

Text-Based PDFs vs Image-Based PDFs

require 'pdf-reader'
require 'rtesseract'

class PDFProcessor
  def initialize(file_path)
    @file_path = file_path
    @reader = PDF::Reader.new(file_path)
  end

  def extract_text
    # First, try to extract text directly
    text = extract_text_directly

    if text.strip.empty? || text.length < 50
      puts "PDF appears to be image-based, attempting OCR..."
      extract_text_with_ocr
    else
      text
    end
  end

  private

  def extract_text_directly
    @reader.pages.map(&:text).join("\n")
  end

  def extract_text_with_ocr
    # This is a simplified example - you'd need additional gems
    # like mini_magick to convert PDF pages to images first

    # For image-based PDFs, you'd typically:
    # 1. Convert PDF pages to images using ImageMagick
    # 2. Use Tesseract OCR to extract text from images

    begin
      # Assuming you have an image file extracted from the PDF
      image_path = convert_pdf_to_image(@file_path)
      rtesseract = RTesseract.new(image_path)
      rtesseract.to_s
    rescue => e
      puts "OCR failed: #{e.message}"
      ""
    end
  end

  def convert_pdf_to_image(pdf_path)
    # This would require additional setup with ImageMagick
    # This is pseudocode - actual implementation would be more complex
    "converted_page.png"
  end
end

Error Handling and Best Practices

require 'pdf-reader'

class RobustPDFExtractor
  def initialize(file_path)
    @file_path = file_path
    validate_file
  end

  def extract_text_safely
    begin
      reader = PDF::Reader.new(@file_path)

      unless reader.respond_to?(:pages)
        raise "Invalid PDF format"
      end

      text_content = ""
      reader.pages.each_with_index do |page, index|
        begin
          page_text = page.text
          text_content += "=== Page #{index + 1} ===\n#{page_text}\n\n"
        rescue => page_error
          puts "Error processing page #{index + 1}: #{page_error.message}"
          text_content += "=== Page #{index + 1} ===\n[Error reading page]\n\n"
        end
      end

      text_content

    rescue PDF::Reader::MalformedPDFError => e
      handle_malformed_pdf(e)
    rescue PDF::Reader::UnsupportedFeatureError => e
      handle_unsupported_feature(e)
    rescue => e
      handle_general_error(e)
    end
  end

  def get_pdf_info
    reader = PDF::Reader.new(@file_path)
    {
      page_count: reader.page_count,
      pdf_version: reader.pdf_version,
      info: reader.info,
      metadata: reader.metadata
    }
  rescue => e
    puts "Error getting PDF info: #{e.message}"
    {}
  end

  private

  def validate_file
    unless File.exist?(@file_path)
      raise "File not found: #{@file_path}"
    end

    unless File.extname(@file_path).downcase == '.pdf'
      puts "Warning: File doesn't have .pdf extension"
    end

    if File.size(@file_path) == 0
      raise "File is empty: #{@file_path}"
    end
  end

  def handle_malformed_pdf(error)
    puts "PDF is malformed or corrupted: #{error.message}"
    puts "Try using a PDF repair tool or contact the document source."
    nil
  end

  def handle_unsupported_feature(error)
    puts "PDF contains unsupported features: #{error.message}"
    puts "This PDF may use advanced features not supported by pdf-reader."
    nil
  end

  def handle_general_error(error)
    puts "Unexpected error: #{error.message}"
    puts "Backtrace: #{error.backtrace.join("\n")}"
    nil
  end
end

# Usage
extractor = RobustPDFExtractor.new("document.pdf")
puts extractor.get_pdf_info
text = extractor.extract_text_safely
puts text if text

OCR for Image-Based PDFs

For PDFs that contain scanned images or are image-based, you'll need OCR:

require 'rtesseract'
require 'mini_magick'  # For image processing

class OCRPDFExtractor
  def initialize(file_path)
    @file_path = file_path
  end

  def extract_with_ocr
    # Convert PDF to images first (requires ImageMagick)
    image_files = convert_pdf_to_images

    extracted_text = ""
    image_files.each_with_index do |image_file, index|
      puts "Processing page #{index + 1} with OCR..."

      # Configure Tesseract for better accuracy
      rtesseract = RTesseract.new(image_file, lang: 'eng')
      rtesseract.config = '--psm 6'  # Uniform block of text

      page_text = rtesseract.to_s
      extracted_text += "=== Page #{index + 1} ===\n#{page_text}\n\n"

      # Clean up temporary image file
      File.delete(image_file) if File.exist?(image_file)
    end

    extracted_text
  rescue => e
    puts "OCR extraction failed: #{e.message}"
    nil
  end

  private

  def convert_pdf_to_images
    # This requires ImageMagick to be installed
    # Convert PDF to PNG images
    images = []

    begin
      # Use MiniMagick to convert PDF pages to images
      image = MiniMagick::Image.open(@file_path)
      image.format "png"
      image.density "300"  # High DPI for better OCR

      # For multi-page PDFs, this creates multiple files
      base_name = File.basename(@file_path, '.pdf')
      temp_dir = Dir.mktmpdir

      # This is simplified - actual implementation would handle multi-page PDFs
      output_path = File.join(temp_dir, "#{base_name}.png")
      image.write(output_path)

      [output_path]
    rescue => e
      puts "Error converting PDF to images: #{e.message}"
      []
    end
  end
end

Performance Considerations

  • Large PDFs: Process pages in batches for memory efficiency
  • OCR: Very CPU-intensive, consider background processing
  • Caching: Cache extracted text to avoid re-processing
  • File validation: Always validate PDF files before processing

Limitations and Considerations

  1. PDF Structure: Text extraction works best with text-based PDFs
  2. Complex Layouts: Tables and multi-column layouts may not preserve formatting
  3. Encrypted PDFs: Password-protected PDFs require additional handling
  4. OCR Accuracy: Image-based text extraction can be error-prone
  5. Performance: Large PDFs can consume significant memory and processing time

Ruby provides excellent tools for PDF data extraction, but success depends on the PDF structure and content format. Start with pdf-reader for text-based PDFs and add OCR capabilities when needed.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon