Extracting data from PDFs in Rust can be accomplished using several libraries designed for PDF parsing and text extraction. The two most popular options are lopdf
for low-level PDF manipulation and pdf-extract
for simpler text extraction.
Method 1: Using pdf-extract (Recommended for Text Extraction)
The pdf-extract
crate provides a simple interface for extracting text from PDF files:
Setup
Add the dependency to your Cargo.toml
:
[dependencies]
pdf-extract = "0.7.0"
Basic Text Extraction
use pdf_extract::extract_text;
use std::fs;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Read PDF file as bytes
let bytes = fs::read("document.pdf")?;
// Extract text from the PDF
let text = extract_text(&bytes)?;
println!("Extracted text:\n{}", text);
Ok(())
}
Advanced Text Extraction with Error Handling
use pdf_extract::extract_text;
use std::fs;
use std::path::Path;
fn extract_pdf_text<P: AsRef<Path>>(pdf_path: P) -> Result<String, Box<dyn std::error::Error>> {
let bytes = fs::read(pdf_path)?;
let text = extract_text(&bytes)?;
Ok(text)
}
fn main() {
match extract_pdf_text("document.pdf") {
Ok(text) => {
println!("Successfully extracted {} characters", text.len());
// Process the text (split into lines, search for patterns, etc.)
let lines: Vec<&str> = text.lines().collect();
println!("Number of lines: {}", lines.len());
// Find lines containing specific keywords
let keywords = ["invoice", "total", "amount"];
for line in lines {
for keyword in keywords {
if line.to_lowercase().contains(keyword) {
println!("Found '{}': {}", keyword, line.trim());
}
}
}
}
Err(e) => eprintln!("Error extracting PDF: {}", e),
}
}
Method 2: Using lopdf (For Advanced PDF Manipulation)
The lopdf
crate provides lower-level access to PDF structure and is better for complex PDF operations:
Setup
[dependencies]
lopdf = "0.30.0"
Basic Document Information
use lopdf::Document;
use std::collections::BTreeMap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load PDF document
let doc = Document::load("document.pdf")?;
// Get document info
println!("PDF Version: {}", doc.version);
println!("Number of pages: {}", doc.get_pages().len());
// Get document metadata
if let Ok(info) = doc.trailer.get(b"Info") {
println!("Document info: {:?}", info);
}
Ok(())
}
Page-by-Page Text Extraction
use lopdf::{Document, Object};
use std::collections::BTreeMap;
fn extract_text_from_page(
doc: &Document,
page_id: (u32, u16),
) -> Result<String, Box<dyn std::error::Error>> {
let page_object = doc.get_object(page_id)?;
if let Object::Dictionary(page_dict) = page_object {
// This is a simplified example - real text extraction is more complex
// You would need to parse the content stream and decode text objects
println!("Processing page {:?}", page_id);
}
Ok(String::new())
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let doc = Document::load("document.pdf")?;
// Iterate through all pages
for (page_number, page_id) in doc.get_pages().iter().enumerate() {
println!("Processing page {}", page_number + 1);
if let Ok(text) = extract_text_from_page(&doc, *page_id) {
if !text.is_empty() {
println!("Page {} text: {}", page_number + 1, text);
}
}
}
Ok(())
}
Structured Data Extraction Example
Here's a practical example for extracting structured data like invoices:
use pdf_extract::extract_text;
use regex::Regex;
use std::fs;
#[derive(Debug)]
struct InvoiceData {
invoice_number: Option<String>,
total_amount: Option<String>,
date: Option<String>,
}
fn extract_invoice_data(pdf_path: &str) -> Result<InvoiceData, Box<dyn std::error::Error>> {
let bytes = fs::read(pdf_path)?;
let text = extract_text(&bytes)?;
// Define regex patterns for common invoice fields
let invoice_regex = Regex::new(r"Invoice\s*#?:?\s*([A-Z0-9-]+)")?;
let amount_regex = Regex::new(r"Total:?\s*\$?(\d+\.?\d*)")?;
let date_regex = Regex::new(r"Date:?\s*(\d{1,2}/\d{1,2}/\d{4})")?;
let invoice_number = invoice_regex
.captures(&text)
.and_then(|cap| cap.get(1))
.map(|m| m.as_str().to_string());
let total_amount = amount_regex
.captures(&text)
.and_then(|cap| cap.get(1))
.map(|m| format!("${}", m.as_str()));
let date = date_regex
.captures(&text)
.and_then(|cap| cap.get(1))
.map(|m| m.as_str().to_string());
Ok(InvoiceData {
invoice_number,
total_amount,
date,
})
}
fn main() {
match extract_invoice_data("invoice.pdf") {
Ok(data) => println!("Invoice data: {:?}", data),
Err(e) => eprintln!("Error: {}", e),
}
}
Alternative Libraries
For specialized use cases, consider these additional libraries:
pdfium-render
: Uses Google's PDFium for high-quality rendering and text extractionpoppler-rs
: Rust bindings for the Poppler PDF librarymupdf-rs
: Bindings for MuPDF, excellent for complex documents
[dependencies]
pdfium-render = "0.8.0"
# or
poppler-rs = "0.23.0"
Important Considerations
PDF Complexity: PDFs can contain text as images, vector graphics, or encoded text. Simple text extraction may not work for all documents.
OCR Requirements: For image-based PDFs, you'll need OCR libraries like
tesseract-rs
.Performance: For large PDF files, consider streaming approaches or processing pages individually.
Error Handling: Always implement robust error handling as PDF parsing can fail for malformed or encrypted documents.
Memory Usage: Large PDFs can consume significant memory. Monitor usage in production applications.
The pdf-extract
library is generally recommended for straightforward text extraction, while lopdf
provides more control for complex PDF manipulation tasks.