Table of contents

How do I Store Scraped Data in a Database Using PHP?

Storing scraped data in a database is a crucial step in any web scraping workflow. PHP provides several robust methods to connect to databases and insert scraped data efficiently. This guide covers the complete process of storing scraped data using PHP with popular database systems like MySQL and PostgreSQL.

Database Connection Methods

Using PDO (PHP Data Objects)

PDO is the recommended approach for database operations in PHP due to its security features and database portability:

<?php
class DatabaseConnection {
    private $host = 'localhost';
    private $dbname = 'scraped_data';
    private $username = 'your_username';
    private $password = 'your_password';
    private $pdo;

    public function __construct() {
        try {
            $dsn = "mysql:host={$this->host};dbname={$this->dbname};charset=utf8mb4";
            $this->pdo = new PDO($dsn, $this->username, $this->password, [
                PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
                PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
                PDO::ATTR_EMULATE_PREPARES => false
            ]);
        } catch (PDOException $e) {
            throw new Exception("Database connection failed: " . $e->getMessage());
        }
    }

    public function getConnection() {
        return $this->pdo;
    }
}
?>

Using MySQLi

For MySQL-specific applications, MySQLi offers both procedural and object-oriented interfaces:

<?php
$mysqli = new mysqli("localhost", "username", "password", "scraped_data");

if ($mysqli->connect_error) {
    die("Connection failed: " . $mysqli->connect_error);
}

// Set charset for proper handling of special characters
$mysqli->set_charset("utf8mb4");
?>

Database Schema Design

Before storing data, create appropriate database tables to match your scraped data structure:

CREATE TABLE products (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    price DECIMAL(10, 2),
    description TEXT,
    image_url VARCHAR(500),
    category VARCHAR(100),
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    source_url VARCHAR(500),
    INDEX idx_category (category),
    INDEX idx_scraped_at (scraped_at)
);

CREATE TABLE product_reviews (
    id INT AUTO_INCREMENT PRIMARY KEY,
    product_id INT,
    reviewer_name VARCHAR(255),
    rating INT,
    review_text TEXT,
    review_date DATE,
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE CASCADE
);

Storing Single Records

Basic Insert with Prepared Statements

<?php
require_once 'database.php';

class DataStore {
    private $db;

    public function __construct() {
        $dbConnection = new DatabaseConnection();
        $this->db = $dbConnection->getConnection();
    }

    public function insertProduct($data) {
        $sql = "INSERT INTO products (name, price, description, image_url, category, source_url) 
                VALUES (:name, :price, :description, :image_url, :category, :source_url)";

        try {
            $stmt = $this->db->prepare($sql);
            $stmt->execute([
                ':name' => $data['name'],
                ':price' => $data['price'],
                ':description' => $data['description'],
                ':image_url' => $data['image_url'],
                ':category' => $data['category'],
                ':source_url' => $data['source_url']
            ]);

            return $this->db->lastInsertId();
        } catch (PDOException $e) {
            error_log("Database insert error: " . $e->getMessage());
            throw new Exception("Failed to insert product data");
        }
    }
}

// Example usage
$scraper = new DataStore();
$productData = [
    'name' => 'Wireless Headphones',
    'price' => 99.99,
    'description' => 'High-quality wireless headphones with noise cancellation',
    'image_url' => 'https://example.com/image.jpg',
    'category' => 'Electronics',
    'source_url' => 'https://example.com/product/123'
];

$productId = $scraper->insertProduct($productData);
echo "Product inserted with ID: " . $productId;
?>

Batch Inserting Multiple Records

For large datasets, batch inserts significantly improve performance:

<?php
class BatchDataStore extends DataStore {

    public function insertProductsBatch($products) {
        if (empty($products)) {
            return false;
        }

        $sql = "INSERT INTO products (name, price, description, image_url, category, source_url) VALUES ";
        $placeholders = [];
        $values = [];

        foreach ($products as $index => $product) {
            $placeholders[] = "(:name{$index}, :price{$index}, :description{$index}, :image_url{$index}, :category{$index}, :source_url{$index})";

            $values[":name{$index}"] = $product['name'];
            $values[":price{$index}"] = $product['price'];
            $values[":description{$index}"] = $product['description'];
            $values[":image_url{$index}"] = $product['image_url'];
            $values[":category{$index}"] = $product['category'];
            $values[":source_url{$index}"] = $product['source_url'];
        }

        $sql .= implode(', ', $placeholders);

        try {
            $this->db->beginTransaction();
            $stmt = $this->db->prepare($sql);
            $stmt->execute($values);
            $this->db->commit();

            return $stmt->rowCount();
        } catch (PDOException $e) {
            $this->db->rollBack();
            error_log("Batch insert error: " . $e->getMessage());
            throw new Exception("Failed to insert batch data");
        }
    }
}

// Example usage
$batchStore = new BatchDataStore();
$products = [
    ['name' => 'Product 1', 'price' => 29.99, /* ... other fields */],
    ['name' => 'Product 2', 'price' => 39.99, /* ... other fields */],
    // ... more products
];

$insertedCount = $batchStore->insertProductsBatch($products);
echo "Inserted {$insertedCount} products";
?>

Handling Duplicate Data

Implement duplicate detection and handling strategies:

<?php
class SmartDataStore extends DataStore {

    public function insertOrUpdateProduct($data) {
        // Check if product already exists
        $existing = $this->findProductByUrl($data['source_url']);

        if ($existing) {
            return $this->updateProduct($existing['id'], $data);
        } else {
            return $this->insertProduct($data);
        }
    }

    private function findProductByUrl($url) {
        $sql = "SELECT * FROM products WHERE source_url = :url LIMIT 1";
        $stmt = $this->db->prepare($sql);
        $stmt->execute([':url' => $url]);
        return $stmt->fetch();
    }

    private function updateProduct($id, $data) {
        $sql = "UPDATE products SET 
                name = :name, 
                price = :price, 
                description = :description, 
                image_url = :image_url, 
                category = :category,
                scraped_at = CURRENT_TIMESTAMP
                WHERE id = :id";

        $stmt = $this->db->prepare($sql);
        return $stmt->execute([
            ':id' => $id,
            ':name' => $data['name'],
            ':price' => $data['price'],
            ':description' => $data['description'],
            ':image_url' => $data['image_url'],
            ':category' => $data['category']
        ]);
    }

    // Alternative: Use INSERT ... ON DUPLICATE KEY UPDATE (MySQL)
    public function upsertProduct($data) {
        $sql = "INSERT INTO products (name, price, description, image_url, category, source_url) 
                VALUES (:name, :price, :description, :image_url, :category, :source_url)
                ON DUPLICATE KEY UPDATE 
                name = VALUES(name),
                price = VALUES(price),
                description = VALUES(description),
                image_url = VALUES(image_url),
                category = VALUES(category),
                scraped_at = CURRENT_TIMESTAMP";

        $stmt = $this->db->prepare($sql);
        return $stmt->execute([
            ':name' => $data['name'],
            ':price' => $data['price'],
            ':description' => $data['description'],
            ':image_url' => $data['image_url'],
            ':category' => $data['category'],
            ':source_url' => $data['source_url']
        ]);
    }
}
?>

Data Validation and Sanitization

Always validate and sanitize data before database insertion:

<?php
class DataValidator {

    public static function validateProduct($data) {
        $errors = [];

        // Required fields validation
        if (empty($data['name'])) {
            $errors[] = "Product name is required";
        }

        if (!filter_var($data['source_url'], FILTER_VALIDATE_URL)) {
            $errors[] = "Invalid source URL";
        }

        // Price validation
        if (isset($data['price']) && !is_numeric($data['price'])) {
            $errors[] = "Price must be numeric";
        }

        // Image URL validation
        if (!empty($data['image_url']) && !filter_var($data['image_url'], FILTER_VALIDATE_URL)) {
            $errors[] = "Invalid image URL";
        }

        return $errors;
    }

    public static function sanitizeProduct($data) {
        return [
            'name' => trim(strip_tags($data['name'])),
            'price' => isset($data['price']) ? (float)$data['price'] : null,
            'description' => isset($data['description']) ? trim($data['description']) : null,
            'image_url' => isset($data['image_url']) ? filter_var($data['image_url'], FILTER_SANITIZE_URL) : null,
            'category' => isset($data['category']) ? trim(strip_tags($data['category'])) : null,
            'source_url' => filter_var($data['source_url'], FILTER_SANITIZE_URL)
        ];
    }
}

// Usage with validation
$rawData = [
    'name' => '  <script>Wireless Headphones</script>  ',
    'price' => '99.99',
    'source_url' => 'https://example.com/product/123'
];

$errors = DataValidator::validateProduct($rawData);
if (!empty($errors)) {
    foreach ($errors as $error) {
        echo "Validation error: {$error}\n";
    }
} else {
    $cleanData = DataValidator::sanitizeProduct($rawData);
    $dataStore = new DataStore();
    $dataStore->insertProduct($cleanData);
}
?>

Working with Different Database Systems

PostgreSQL Configuration

<?php
// PostgreSQL connection
$dsn = "pgsql:host=localhost;dbname=scraped_data;port=5432";
$pdo = new PDO($dsn, $username, $password, [
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
    PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC
]);

// PostgreSQL-specific features
class PostgreSQLDataStore {
    private $pdo;

    public function insertProductWithReturning($data) {
        $sql = "INSERT INTO products (name, price, description, image_url, category, source_url) 
                VALUES ($1, $2, $3, $4, $5, $6) 
                RETURNING id, scraped_at";

        $stmt = $this->pdo->prepare($sql);
        $stmt->execute(array_values($data));
        return $stmt->fetch();
    }

    // Use UPSERT with PostgreSQL's ON CONFLICT
    public function upsertProductPostgreSQL($data) {
        $sql = "INSERT INTO products (name, price, description, image_url, category, source_url) 
                VALUES ($1, $2, $3, $4, $5, $6)
                ON CONFLICT (source_url) DO UPDATE SET 
                name = EXCLUDED.name,
                price = EXCLUDED.price,
                description = EXCLUDED.description,
                image_url = EXCLUDED.image_url,
                category = EXCLUDED.category,
                scraped_at = CURRENT_TIMESTAMP";

        $stmt = $this->pdo->prepare($sql);
        return $stmt->execute(array_values($data));
    }
}
?>

Performance Optimization

Connection Pooling and Transactions

<?php
class OptimizedDataStore {
    private $pdo;
    private $batchSize = 1000;

    public function insertLargeDataset($products) {
        $totalInserted = 0;
        $batches = array_chunk($products, $this->batchSize);

        foreach ($batches as $batch) {
            try {
                $this->pdo->beginTransaction();

                $sql = "INSERT INTO products (name, price, description, image_url, category, source_url) VALUES ";
                $placeholders = [];
                $values = [];

                foreach ($batch as $index => $product) {
                    $placeholders[] = "(?, ?, ?, ?, ?, ?)";
                    $values = array_merge($values, array_values($product));
                }

                $sql .= implode(', ', $placeholders);
                $stmt = $this->pdo->prepare($sql);
                $stmt->execute($values);

                $this->pdo->commit();
                $totalInserted += count($batch);

                // Progress tracking
                echo "Inserted batch of " . count($batch) . " products. Total: {$totalInserted}\n";

            } catch (PDOException $e) {
                $this->pdo->rollBack();
                error_log("Batch insert failed: " . $e->getMessage());
                continue; // Skip this batch and continue with the next
            }
        }

        return $totalInserted;
    }
}
?>

Error Handling and Logging

<?php
class RobustDataStore extends DataStore {
    private $logger;

    public function __construct() {
        parent::__construct();
        $this->logger = new Logger('scraper');
        $this->logger->pushHandler(new StreamHandler('scraper.log', Logger::INFO));
    }

    public function safeInsertProduct($data) {
        try {
            $this->logger->info('Attempting to insert product', ['name' => $data['name']]);

            $productId = $this->insertProduct($data);

            $this->logger->info('Product inserted successfully', [
                'id' => $productId,
                'name' => $data['name']
            ]);

            return $productId;

        } catch (Exception $e) {
            $this->logger->error('Failed to insert product', [
                'error' => $e->getMessage(),
                'data' => $data
            ]);

            // Could implement retry logic here
            return false;
        }
    }
}
?>

Best Practices

  1. Use Prepared Statements: Always use prepared statements to prevent SQL injection attacks
  2. Validate Input Data: Implement comprehensive validation before database operations
  3. Handle Duplicates: Plan for duplicate data scenarios in your scraping workflow
  4. Optimize Batch Operations: Use batch inserts for large datasets to improve performance
  5. Implement Error Handling: Use try-catch blocks and proper logging for debugging
  6. Use Transactions: Wrap related operations in database transactions for data consistency
  7. Index Strategic Columns: Create database indexes on frequently queried columns
  8. Monitor Performance: Track insertion rates and optimize queries as needed

When scraping large amounts of data, consider implementing these database storage strategies alongside proper web scraping techniques for handling dynamic content and efficient data extraction methods.

By following these practices and examples, you can efficiently store scraped data in databases using PHP while maintaining data integrity and application performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon