How can I avoid scraping duplicate reviews from Trustpilot?

When scraping reviews from Trustpilot or any similar website, it's important to avoid duplicates to maintain the quality of your data. Here are some strategies you can use to avoid scraping duplicate reviews:

1. Check for Unique Identifiers

Most reviews on websites like Trustpilot have unique identifiers, such as a review ID. Always check for and record these identifiers. Before saving or processing a new review, compare its identifier against those you've already saved to ensure it's not a duplicate.

2. Use URL Parameters

Trustpilot and similar sites often use pagination or filters in their URLs. Make sure you're correctly handling these parameters to avoid revisiting the same page.

3. Hashing Content

You can create a hash of the review content (and any other unique fields) and compare it with hashes of reviews you've already scraped. This ensures that even if the unique identifier is missing or unreliable, you can still detect duplicates.

4. Timestamps and Sorting

If the website sorts reviews (e.g., by date), keep track of the timestamps of the reviews you've scraped. On subsequent scrapes, you can ignore reviews older than the most recent timestamp you've recorded.

5. Use a Database with Uniqueness Constraints

If you're storing scraped reviews in a database, you can use database features like uniqueness constraints or unique indexes on fields that should be unique (like review IDs or hashes).

Here's a simplified Python example using requests and BeautifulSoup for scraping reviews and avoiding duplicates by using unique identifiers:

import requests
from bs4 import BeautifulSoup
import hashlib

# Set up a set to store hashes of reviews to check for duplicates
review_hashes = set()

def get_review_hash(review):
    return hashlib.md5(review.encode('utf-8')).hexdigest()

def scrape_trustpilot_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    reviews = soup.find_all('div', class_='review')  # Adjust the selector based on the actual page

    for review in reviews:
        review_id = review.get('data-review-id')  # Adjust based on how the review ID is stored
        review_content = review.get_text(strip=True)
        review_hash = get_review_hash(review_content)

        if review_id and review_hash not in review_hashes:
            review_hashes.add(review_hash)
            # Process and/or store the review here
            print(f'New review found: {review_content}')
        else:
            print('Duplicate review detected, skipping.')

# Example usage
page_url = 'https://www.trustpilot.com/review/example.com'
scrape_trustpilot_page(page_url)

In JavaScript (Node.js), you could use a similar approach with libraries like axios and cheerio. Here's an example:

const axios = require('axios');
const cheerio = require('cheerio');
const crypto = require('crypto');

let reviewHashes = new Set();

function getReviewHash(review) {
  return crypto.createHash('md5').update(review).digest('hex');
}

async function scrapeTrustpilotPage(url) {
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);
  const reviews = $('.review'); // Adjust the selector based on the actual page

  reviews.each((_, element) => {
    const reviewId = $(element).data('review-id'); // Adjust based on how the review ID is stored
    const reviewContent = $(element).text().trim();
    const reviewHash = getReviewHash(reviewContent);

    if (reviewId && !reviewHashes.has(reviewHash)) {
      reviewHashes.add(reviewHash);
      // Process and/or store the review here
      console.log(`New review found: ${reviewContent}`);
    } else {
      console.log('Duplicate review detected, skipping.');
    }
  });
}

// Example usage
const pageUrl = 'https://www.trustpilot.com/review/example.com';
scrapeTrustpilotPage(pageUrl);

Remember to respect Trustpilot's terms of service and robots.txt file when scraping. Excessive scraping can lead to IP bans or legal action, and it's generally good practice to scrape responsibly and ethically.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon