How do you extract data from a PDF using Rust?

Extracting data from PDFs in Rust can be accomplished using several libraries designed for PDF parsing and text extraction. The two most popular options are lopdf for low-level PDF manipulation and pdf-extract for simpler text extraction.

Method 1: Using pdf-extract (Recommended for Text Extraction)

The pdf-extract crate provides a simple interface for extracting text from PDF files:

Setup

Add the dependency to your Cargo.toml:

[dependencies]
pdf-extract = "0.7.0"

Basic Text Extraction

use pdf_extract::extract_text;
use std::fs;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read PDF file as bytes
    let bytes = fs::read("document.pdf")?;

    // Extract text from the PDF
    let text = extract_text(&bytes)?;

    println!("Extracted text:\n{}", text);
    Ok(())
}

Advanced Text Extraction with Error Handling

use pdf_extract::extract_text;
use std::fs;
use std::path::Path;

fn extract_pdf_text<P: AsRef<Path>>(pdf_path: P) -> Result<String, Box<dyn std::error::Error>> {
    let bytes = fs::read(pdf_path)?;
    let text = extract_text(&bytes)?;
    Ok(text)
}

fn main() {
    match extract_pdf_text("document.pdf") {
        Ok(text) => {
            println!("Successfully extracted {} characters", text.len());

            // Process the text (split into lines, search for patterns, etc.)
            let lines: Vec<&str> = text.lines().collect();
            println!("Number of lines: {}", lines.len());

            // Find lines containing specific keywords
            let keywords = ["invoice", "total", "amount"];
            for line in lines {
                for keyword in keywords {
                    if line.to_lowercase().contains(keyword) {
                        println!("Found '{}': {}", keyword, line.trim());
                    }
                }
            }
        }
        Err(e) => eprintln!("Error extracting PDF: {}", e),
    }
}

Method 2: Using lopdf (For Advanced PDF Manipulation)

The lopdf crate provides lower-level access to PDF structure and is better for complex PDF operations:

Setup

[dependencies]
lopdf = "0.30.0"

Basic Document Information

use lopdf::Document;
use std::collections::BTreeMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load PDF document
    let doc = Document::load("document.pdf")?;

    // Get document info
    println!("PDF Version: {}", doc.version);
    println!("Number of pages: {}", doc.get_pages().len());

    // Get document metadata
    if let Ok(info) = doc.trailer.get(b"Info") {
        println!("Document info: {:?}", info);
    }

    Ok(())
}

Page-by-Page Text Extraction

use lopdf::{Document, Object};
use std::collections::BTreeMap;

fn extract_text_from_page(
    doc: &Document,
    page_id: (u32, u16),
) -> Result<String, Box<dyn std::error::Error>> {
    let page_object = doc.get_object(page_id)?;

    if let Object::Dictionary(page_dict) = page_object {
        // This is a simplified example - real text extraction is more complex
        // You would need to parse the content stream and decode text objects
        println!("Processing page {:?}", page_id);
    }

    Ok(String::new())
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = Document::load("document.pdf")?;

    // Iterate through all pages
    for (page_number, page_id) in doc.get_pages().iter().enumerate() {
        println!("Processing page {}", page_number + 1);

        if let Ok(text) = extract_text_from_page(&doc, *page_id) {
            if !text.is_empty() {
                println!("Page {} text: {}", page_number + 1, text);
            }
        }
    }

    Ok(())
}

Structured Data Extraction Example

Here's a practical example for extracting structured data like invoices:

use pdf_extract::extract_text;
use regex::Regex;
use std::fs;

#[derive(Debug)]
struct InvoiceData {
    invoice_number: Option<String>,
    total_amount: Option<String>,
    date: Option<String>,
}

fn extract_invoice_data(pdf_path: &str) -> Result<InvoiceData, Box<dyn std::error::Error>> {
    let bytes = fs::read(pdf_path)?;
    let text = extract_text(&bytes)?;

    // Define regex patterns for common invoice fields
    let invoice_regex = Regex::new(r"Invoice\s*#?:?\s*([A-Z0-9-]+)")?;
    let amount_regex = Regex::new(r"Total:?\s*\$?(\d+\.?\d*)")?;
    let date_regex = Regex::new(r"Date:?\s*(\d{1,2}/\d{1,2}/\d{4})")?;

    let invoice_number = invoice_regex
        .captures(&text)
        .and_then(|cap| cap.get(1))
        .map(|m| m.as_str().to_string());

    let total_amount = amount_regex
        .captures(&text)
        .and_then(|cap| cap.get(1))
        .map(|m| format!("${}", m.as_str()));

    let date = date_regex
        .captures(&text)
        .and_then(|cap| cap.get(1))
        .map(|m| m.as_str().to_string());

    Ok(InvoiceData {
        invoice_number,
        total_amount,
        date,
    })
}

fn main() {
    match extract_invoice_data("invoice.pdf") {
        Ok(data) => println!("Invoice data: {:?}", data),
        Err(e) => eprintln!("Error: {}", e),
    }
}

Alternative Libraries

For specialized use cases, consider these additional libraries:

pdfium-render: Uses Google's PDFium for high-quality rendering and text extraction
poppler-rs: Rust bindings for the Poppler PDF library
mupdf-rs: Bindings for MuPDF, excellent for complex documents

[dependencies]
pdfium-render = "0.8.0"
# or
poppler-rs = "0.23.0"

Important Considerations

PDF Complexity: PDFs can contain text as images, vector graphics, or encoded text. Simple text extraction may not work for all documents.
OCR Requirements: For image-based PDFs, you'll need OCR libraries like tesseract-rs.
Performance: For large PDF files, consider streaming approaches or processing pages individually.
Error Handling: Always implement robust error handling as PDF parsing can fail for malformed or encrypted documents.
Memory Usage: Large PDFs can consume significant memory. Monitor usage in production applications.

The pdf-extract library is generally recommended for straightforward text extraction, while lopdf provides more control for complex PDF manipulation tasks.

Table of contents

How do you extract data from a PDF using Rust?

Method 1: Using pdf-extract (Recommended for Text Extraction)

Setup

Basic Text Extraction

Advanced Text Extraction with Error Handling

Method 2: Using lopdf (For Advanced PDF Manipulation)

Setup

Basic Document Information

Page-by-Page Text Extraction

Structured Data Extraction Example

Alternative Libraries

Important Considerations

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How to scrape a website with login authentication using Rust?

Get Started Now