Table of contents

How do you extract data from a PDF using Rust?

Extracting data from PDFs in Rust can be accomplished using several libraries designed for PDF parsing and text extraction. The two most popular options are lopdf for low-level PDF manipulation and pdf-extract for simpler text extraction.

Method 1: Using pdf-extract (Recommended for Text Extraction)

The pdf-extract crate provides a simple interface for extracting text from PDF files:

Setup

Add the dependency to your Cargo.toml:

[dependencies]
pdf-extract = "0.7.0"

Basic Text Extraction

use pdf_extract::extract_text;
use std::fs;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read PDF file as bytes
    let bytes = fs::read("document.pdf")?;

    // Extract text from the PDF
    let text = extract_text(&bytes)?;

    println!("Extracted text:\n{}", text);
    Ok(())
}

Advanced Text Extraction with Error Handling

use pdf_extract::extract_text;
use std::fs;
use std::path::Path;

fn extract_pdf_text<P: AsRef<Path>>(pdf_path: P) -> Result<String, Box<dyn std::error::Error>> {
    let bytes = fs::read(pdf_path)?;
    let text = extract_text(&bytes)?;
    Ok(text)
}

fn main() {
    match extract_pdf_text("document.pdf") {
        Ok(text) => {
            println!("Successfully extracted {} characters", text.len());

            // Process the text (split into lines, search for patterns, etc.)
            let lines: Vec<&str> = text.lines().collect();
            println!("Number of lines: {}", lines.len());

            // Find lines containing specific keywords
            let keywords = ["invoice", "total", "amount"];
            for line in lines {
                for keyword in keywords {
                    if line.to_lowercase().contains(keyword) {
                        println!("Found '{}': {}", keyword, line.trim());
                    }
                }
            }
        }
        Err(e) => eprintln!("Error extracting PDF: {}", e),
    }
}

Method 2: Using lopdf (For Advanced PDF Manipulation)

The lopdf crate provides lower-level access to PDF structure and is better for complex PDF operations:

Setup

[dependencies]
lopdf = "0.30.0"

Basic Document Information

use lopdf::Document;
use std::collections::BTreeMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load PDF document
    let doc = Document::load("document.pdf")?;

    // Get document info
    println!("PDF Version: {}", doc.version);
    println!("Number of pages: {}", doc.get_pages().len());

    // Get document metadata
    if let Ok(info) = doc.trailer.get(b"Info") {
        println!("Document info: {:?}", info);
    }

    Ok(())
}

Page-by-Page Text Extraction

use lopdf::{Document, Object};
use std::collections::BTreeMap;

fn extract_text_from_page(
    doc: &Document,
    page_id: (u32, u16),
) -> Result<String, Box<dyn std::error::Error>> {
    let page_object = doc.get_object(page_id)?;

    if let Object::Dictionary(page_dict) = page_object {
        // This is a simplified example - real text extraction is more complex
        // You would need to parse the content stream and decode text objects
        println!("Processing page {:?}", page_id);
    }

    Ok(String::new())
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = Document::load("document.pdf")?;

    // Iterate through all pages
    for (page_number, page_id) in doc.get_pages().iter().enumerate() {
        println!("Processing page {}", page_number + 1);

        if let Ok(text) = extract_text_from_page(&doc, *page_id) {
            if !text.is_empty() {
                println!("Page {} text: {}", page_number + 1, text);
            }
        }
    }

    Ok(())
}

Structured Data Extraction Example

Here's a practical example for extracting structured data like invoices:

use pdf_extract::extract_text;
use regex::Regex;
use std::fs;

#[derive(Debug)]
struct InvoiceData {
    invoice_number: Option<String>,
    total_amount: Option<String>,
    date: Option<String>,
}

fn extract_invoice_data(pdf_path: &str) -> Result<InvoiceData, Box<dyn std::error::Error>> {
    let bytes = fs::read(pdf_path)?;
    let text = extract_text(&bytes)?;

    // Define regex patterns for common invoice fields
    let invoice_regex = Regex::new(r"Invoice\s*#?:?\s*([A-Z0-9-]+)")?;
    let amount_regex = Regex::new(r"Total:?\s*\$?(\d+\.?\d*)")?;
    let date_regex = Regex::new(r"Date:?\s*(\d{1,2}/\d{1,2}/\d{4})")?;

    let invoice_number = invoice_regex
        .captures(&text)
        .and_then(|cap| cap.get(1))
        .map(|m| m.as_str().to_string());

    let total_amount = amount_regex
        .captures(&text)
        .and_then(|cap| cap.get(1))
        .map(|m| format!("${}", m.as_str()));

    let date = date_regex
        .captures(&text)
        .and_then(|cap| cap.get(1))
        .map(|m| m.as_str().to_string());

    Ok(InvoiceData {
        invoice_number,
        total_amount,
        date,
    })
}

fn main() {
    match extract_invoice_data("invoice.pdf") {
        Ok(data) => println!("Invoice data: {:?}", data),
        Err(e) => eprintln!("Error: {}", e),
    }
}

Alternative Libraries

For specialized use cases, consider these additional libraries:

  • pdfium-render: Uses Google's PDFium for high-quality rendering and text extraction
  • poppler-rs: Rust bindings for the Poppler PDF library
  • mupdf-rs: Bindings for MuPDF, excellent for complex documents
[dependencies]
pdfium-render = "0.8.0"
# or
poppler-rs = "0.23.0"

Important Considerations

  1. PDF Complexity: PDFs can contain text as images, vector graphics, or encoded text. Simple text extraction may not work for all documents.

  2. OCR Requirements: For image-based PDFs, you'll need OCR libraries like tesseract-rs.

  3. Performance: For large PDF files, consider streaming approaches or processing pages individually.

  4. Error Handling: Always implement robust error handling as PDF parsing can fail for malformed or encrypted documents.

  5. Memory Usage: Large PDFs can consume significant memory. Monitor usage in production applications.

The pdf-extract library is generally recommended for straightforward text extraction, while lopdf provides more control for complex PDF manipulation tasks.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon