When scraping reviews from Trustpilot or any similar website, it's important to avoid duplicates to maintain the quality of your data. Here are some strategies you can use to avoid scraping duplicate reviews:
1. Check for Unique Identifiers
Most reviews on websites like Trustpilot have unique identifiers, such as a review ID. Always check for and record these identifiers. Before saving or processing a new review, compare its identifier against those you've already saved to ensure it's not a duplicate.
2. Use URL Parameters
Trustpilot and similar sites often use pagination or filters in their URLs. Make sure you're correctly handling these parameters to avoid revisiting the same page.
3. Hashing Content
You can create a hash of the review content (and any other unique fields) and compare it with hashes of reviews you've already scraped. This ensures that even if the unique identifier is missing or unreliable, you can still detect duplicates.
4. Timestamps and Sorting
If the website sorts reviews (e.g., by date), keep track of the timestamps of the reviews you've scraped. On subsequent scrapes, you can ignore reviews older than the most recent timestamp you've recorded.
5. Use a Database with Uniqueness Constraints
If you're storing scraped reviews in a database, you can use database features like uniqueness constraints or unique indexes on fields that should be unique (like review IDs or hashes).
Here's a simplified Python example using requests
and BeautifulSoup
for scraping reviews and avoiding duplicates by using unique identifiers:
import requests
from bs4 import BeautifulSoup
import hashlib
# Set up a set to store hashes of reviews to check for duplicates
review_hashes = set()
def get_review_hash(review):
return hashlib.md5(review.encode('utf-8')).hexdigest()
def scrape_trustpilot_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
reviews = soup.find_all('div', class_='review') # Adjust the selector based on the actual page
for review in reviews:
review_id = review.get('data-review-id') # Adjust based on how the review ID is stored
review_content = review.get_text(strip=True)
review_hash = get_review_hash(review_content)
if review_id and review_hash not in review_hashes:
review_hashes.add(review_hash)
# Process and/or store the review here
print(f'New review found: {review_content}')
else:
print('Duplicate review detected, skipping.')
# Example usage
page_url = 'https://www.trustpilot.com/review/example.com'
scrape_trustpilot_page(page_url)
In JavaScript (Node.js), you could use a similar approach with libraries like axios
and cheerio
. Here's an example:
const axios = require('axios');
const cheerio = require('cheerio');
const crypto = require('crypto');
let reviewHashes = new Set();
function getReviewHash(review) {
return crypto.createHash('md5').update(review).digest('hex');
}
async function scrapeTrustpilotPage(url) {
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const reviews = $('.review'); // Adjust the selector based on the actual page
reviews.each((_, element) => {
const reviewId = $(element).data('review-id'); // Adjust based on how the review ID is stored
const reviewContent = $(element).text().trim();
const reviewHash = getReviewHash(reviewContent);
if (reviewId && !reviewHashes.has(reviewHash)) {
reviewHashes.add(reviewHash);
// Process and/or store the review here
console.log(`New review found: ${reviewContent}`);
} else {
console.log('Duplicate review detected, skipping.');
}
});
}
// Example usage
const pageUrl = 'https://www.trustpilot.com/review/example.com';
scrapeTrustpilotPage(pageUrl);
Remember to respect Trustpilot's terms of service and robots.txt file when scraping. Excessive scraping can lead to IP bans or legal action, and it's generally good practice to scrape responsibly and ethically.