Table of contents

How do I implement data deduplication in Go scraping?

Data deduplication is a crucial aspect of web scraping that prevents processing the same content multiple times, reduces storage requirements, and improves scraping efficiency. In Go, there are several effective approaches to implement deduplication, ranging from simple in-memory solutions to sophisticated database-backed systems.

Understanding Data Deduplication

Data deduplication in web scraping involves identifying and eliminating duplicate records based on specific criteria such as URLs, content hashes, or unique identifiers. This process is essential when scraping large datasets, crawling multiple pages, or running periodic scraping jobs.

Method 1: Hash Map-Based Deduplication

The simplest approach uses Go's built-in map data structure to track processed items using unique identifiers or content hashes.

URL-Based Deduplication

package main

import (
    "fmt"
    "net/http"
    "sync"
)

type URLDeduplicator struct {
    visited map[string]bool
    mutex   sync.RWMutex
}

func NewURLDeduplicator() *URLDeduplicator {
    return &URLDeduplicator{
        visited: make(map[string]bool),
    }
}

func (d *URLDeduplicator) IsVisited(url string) bool {
    d.mutex.RLock()
    defer d.mutex.RUnlock()
    return d.visited[url]
}

func (d *URLDeduplicator) MarkVisited(url string) {
    d.mutex.Lock()
    defer d.mutex.Unlock()
    d.visited[url] = true
}

func (d *URLDeduplicator) ShouldProcess(url string) bool {
    if d.IsVisited(url) {
        return false
    }
    d.MarkVisited(url)
    return true
}

// Usage example
func main() {
    deduplicator := NewURLDeduplicator()
    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page1", // Duplicate
    }

    for _, url := range urls {
        if deduplicator.ShouldProcess(url) {
            fmt.Printf("Processing: %s\n", url)
            // Perform scraping logic here
        } else {
            fmt.Printf("Skipping duplicate: %s\n", url)
        }
    }
}

Content Hash-Based Deduplication

For more sophisticated deduplication based on content similarity:

package main

import (
    "crypto/sha256"
    "fmt"
    "io"
    "net/http"
    "strings"
    "sync"
)

type ContentDeduplicator struct {
    hashes map[string]bool
    mutex  sync.RWMutex
}

func NewContentDeduplicator() *ContentDeduplicator {
    return &ContentDeduplicator{
        hashes: make(map[string]bool),
    }
}

func (d *ContentDeduplicator) GenerateHash(content string) string {
    // Remove whitespace and normalize for better deduplication
    normalized := strings.TrimSpace(strings.ReplaceAll(content, " ", ""))
    hash := sha256.Sum256([]byte(normalized))
    return fmt.Sprintf("%x", hash)
}

func (d *ContentDeduplicator) IsDuplicate(content string) bool {
    hash := d.GenerateHash(content)
    d.mutex.RLock()
    defer d.mutex.RUnlock()
    return d.hashes[hash]
}

func (d *ContentDeduplicator) AddContent(content string) bool {
    hash := d.GenerateHash(content)
    d.mutex.Lock()
    defer d.mutex.Unlock()

    if d.hashes[hash] {
        return false // Duplicate found
    }

    d.hashes[hash] = true
    return true // New content added
}

// Scraping function with content deduplication
func scrapeWithDeduplication(url string, deduplicator *ContentDeduplicator) error {
    resp, err := http.Get(url)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return err
    }

    content := string(body)
    if deduplicator.AddContent(content) {
        fmt.Printf("New content found at %s\n", url)
        // Process unique content
        return processContent(content)
    } else {
        fmt.Printf("Duplicate content at %s\n", url)
        return nil
    }
}

func processContent(content string) error {
    // Implementation for processing unique content
    fmt.Printf("Processing %d bytes of unique content\n", len(content))
    return nil
}

Method 2: Bloom Filter Implementation

Bloom filters provide memory-efficient probabilistic deduplication with configurable false positive rates:

package main

import (
    "crypto/sha256"
    "fmt"
    "math"
    "sync"
)

type BloomFilter struct {
    bits     []bool
    size     uint
    hashFunc uint
    mutex    sync.RWMutex
}

func NewBloomFilter(expectedItems uint, falsePositiveRate float64) *BloomFilter {
    size := uint(-float64(expectedItems) * math.Log(falsePositiveRate) / (math.Log(2) * math.Log(2)))
    hashFunc := uint(float64(size) / float64(expectedItems) * math.Log(2))

    return &BloomFilter{
        bits:     make([]bool, size),
        size:     size,
        hashFunc: hashFunc,
    }
}

func (bf *BloomFilter) hash(data []byte, seed uint) uint {
    hash := sha256.Sum256(append(data, byte(seed)))
    result := uint(0)
    for i := 0; i < 4; i++ {
        result = result<<8 + uint(hash[i])
    }
    return result % bf.size
}

func (bf *BloomFilter) Add(data string) {
    bf.mutex.Lock()
    defer bf.mutex.Unlock()

    dataBytes := []byte(data)
    for i := uint(0); i < bf.hashFunc; i++ {
        index := bf.hash(dataBytes, i)
        bf.bits[index] = true
    }
}

func (bf *BloomFilter) Contains(data string) bool {
    bf.mutex.RLock()
    defer bf.mutex.RUnlock()

    dataBytes := []byte(data)
    for i := uint(0); i < bf.hashFunc; i++ {
        index := bf.hash(dataBytes, i)
        if !bf.bits[index] {
            return false
        }
    }
    return true
}

// Deduplicator using Bloom Filter
type BloomDeduplicator struct {
    filter *BloomFilter
}

func NewBloomDeduplicator(expectedItems uint, falsePositiveRate float64) *BloomDeduplicator {
    return &BloomDeduplicator{
        filter: NewBloomFilter(expectedItems, falsePositiveRate),
    }
}

func (bd *BloomDeduplicator) ShouldProcess(url string) bool {
    if bd.filter.Contains(url) {
        return false // Probably seen before
    }
    bd.filter.Add(url)
    return true
}

// Usage example
func main() {
    // Create bloom filter for 10,000 items with 1% false positive rate
    deduplicator := NewBloomDeduplicator(10000, 0.01)

    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page1", // Duplicate
    }

    for _, url := range urls {
        if deduplicator.ShouldProcess(url) {
            fmt.Printf("Processing: %s\n", url)
        } else {
            fmt.Printf("Likely duplicate: %s\n", url)
        }
    }
}

Method 3: Database-Backed Deduplication

For persistent deduplication across scraping sessions, use a database solution:

package main

import (
    "database/sql"
    "fmt"
    "log"
    _ "github.com/lib/pq"
)

type DatabaseDeduplicator struct {
    db *sql.DB
}

func NewDatabaseDeduplicator(dbURL string) (*DatabaseDeduplicator, error) {
    db, err := sql.Open("postgres", dbURL)
    if err != nil {
        return nil, err
    }

    // Create table if not exists
    _, err = db.Exec(`
        CREATE TABLE IF NOT EXISTS scraped_urls (
            id SERIAL PRIMARY KEY,
            url_hash VARCHAR(64) UNIQUE NOT NULL,
            url TEXT NOT NULL,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    `)
    if err != nil {
        return nil, err
    }

    return &DatabaseDeduplicator{db: db}, nil
}

func (dd *DatabaseDeduplicator) IsProcessed(url string) (bool, error) {
    hash := generateURLHash(url)
    var exists bool

    err := dd.db.QueryRow(
        "SELECT EXISTS(SELECT 1 FROM scraped_urls WHERE url_hash = $1)",
        hash,
    ).Scan(&exists)

    return exists, err
}

func (dd *DatabaseDeduplicator) MarkProcessed(url string) error {
    hash := generateURLHash(url)

    _, err := dd.db.Exec(
        "INSERT INTO scraped_urls (url_hash, url) VALUES ($1, $2) ON CONFLICT (url_hash) DO NOTHING",
        hash, url,
    )

    return err
}

func (dd *DatabaseDeduplicator) ShouldProcess(url string) (bool, error) {
    processed, err := dd.IsProcessed(url)
    if err != nil {
        return false, err
    }

    if processed {
        return false, nil
    }

    return true, dd.MarkProcessed(url)
}

func generateURLHash(url string) string {
    hash := sha256.Sum256([]byte(url))
    return fmt.Sprintf("%x", hash)
}

// Usage example
func main() {
    deduplicator, err := NewDatabaseDeduplicator("postgres://user:password@localhost/scraping_db?sslmode=disable")
    if err != nil {
        log.Fatal(err)
    }

    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page1", // Duplicate
    }

    for _, url := range urls {
        shouldProcess, err := deduplicator.ShouldProcess(url)
        if err != nil {
            log.Printf("Error checking URL %s: %v", url, err)
            continue
        }

        if shouldProcess {
            fmt.Printf("Processing: %s\n", url)
            // Perform scraping logic here
        } else {
            fmt.Printf("Skipping duplicate: %s\n", url)
        }
    }
}

Method 4: Advanced Deduplication with Custom Criteria

For complex deduplication requirements, implement custom comparison logic:

package main

import (
    "encoding/json"
    "fmt"
    "reflect"
    "sync"
)

type Product struct {
    Name        string  `json:"name"`
    Price       float64 `json:"price"`
    Description string  `json:"description"`
    SKU         string  `json:"sku"`
}

type ProductDeduplicator struct {
    products map[string]Product
    mutex    sync.RWMutex
}

func NewProductDeduplicator() *ProductDeduplicator {
    return &ProductDeduplicator{
        products: make(map[string]Product),
    }
}

func (pd *ProductDeduplicator) generateKey(product Product) string {
    // Use SKU as primary key, fallback to name+price combination
    if product.SKU != "" {
        return fmt.Sprintf("sku:%s", product.SKU)
    }
    return fmt.Sprintf("name_price:%s_%.2f", product.Name, product.Price)
}

func (pd *ProductDeduplicator) IsDuplicate(product Product) bool {
    key := pd.generateKey(product)
    pd.mutex.RLock()
    defer pd.mutex.RUnlock()

    existing, exists := pd.products[key]
    if !exists {
        return false
    }

    // Additional similarity check
    return pd.isSimilar(existing, product)
}

func (pd *ProductDeduplicator) isSimilar(p1, p2 Product) bool {
    // Custom similarity logic
    if p1.SKU != "" && p2.SKU != "" {
        return p1.SKU == p2.SKU
    }

    // Name and price-based comparison with tolerance
    nameMatch := p1.Name == p2.Name
    priceMatch := abs(p1.Price-p2.Price) < 0.01

    return nameMatch && priceMatch
}

func abs(x float64) float64 {
    if x < 0 {
        return -x
    }
    return x
}

func (pd *ProductDeduplicator) AddProduct(product Product) bool {
    if pd.IsDuplicate(product) {
        return false
    }

    key := pd.generateKey(product)
    pd.mutex.Lock()
    defer pd.mutex.Unlock()

    pd.products[key] = product
    return true
}

// Usage example
func main() {
    deduplicator := NewProductDeduplicator()

    products := []Product{
        {Name: "Laptop", Price: 999.99, SKU: "LP001"},
        {Name: "Mouse", Price: 29.99, Description: "Wireless mouse"},
        {Name: "Laptop", Price: 999.99, SKU: "LP001"}, // Duplicate by SKU
        {Name: "Mouse", Price: 29.99, Description: "Different description"}, // Duplicate by name+price
    }

    for i, product := range products {
        if deduplicator.AddProduct(product) {
            fmt.Printf("Added product %d: %s\n", i+1, product.Name)
        } else {
            fmt.Printf("Skipped duplicate product %d: %s\n", i+1, product.Name)
        }
    }
}

Performance Considerations

Memory Usage

  • Hash Maps: O(n) memory usage, fast lookups
  • Bloom Filters: Fixed memory usage, probabilistic results
  • Database: Persistent storage, slower but scalable

Concurrency Safety

All examples include proper mutex usage for concurrent access. For high-concurrency scenarios, consider using sync.Map or partitioned maps to reduce lock contention.

Optimization Tips

  1. Batch Database Operations: Group multiple deduplication checks into single queries
  2. Use Appropriate Hash Functions: Choose hash functions that minimize collisions
  3. Implement TTL: Add time-based expiration for temporary deduplication
  4. Monitor Memory Usage: Implement periodic cleanup for long-running scrapers

Best Practices

  1. Choose the Right Method: Use hash maps for small datasets, bloom filters for memory-constrained environments, and databases for persistent deduplication
  2. Normalize Data: Standardize URLs and content before comparison
  3. Handle Edge Cases: Account for redirects, query parameters, and content variations
  4. Monitor Performance: Track deduplication effectiveness and adjust parameters as needed
  5. Implement Fallbacks: Have backup deduplication strategies for critical applications

Conclusion

Implementing effective data deduplication in Go scraping applications requires choosing the appropriate strategy based on your specific requirements. Whether using simple hash maps for small-scale operations or sophisticated database solutions for enterprise applications, proper deduplication significantly improves scraping efficiency and data quality. Consider factors like memory constraints, persistence requirements, and accuracy needs when selecting your approach.

For complex scraping scenarios involving JavaScript-heavy sites, you might want to explore how similar deduplication principles apply when handling dynamic content that loads after page load or when implementing concurrent scraping strategies.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon