Web scraping should always be performed ethically and legally. Before scraping Trustpilot reviews or any website, you should carefully review the site's Terms of Service or robots.txt file to determine if scraping is permitted. If the website explicitly forbids scraping, you should respect their rules and possibly seek an alternative method such as using their official API if one is available.
Here are some general guidelines to scrape Trustpilot reviews without disturbing the website's performance:
Rate Limiting: Do not send requests too quickly; space them out. For example, you might send a request every few seconds rather than multiple requests per second.
Respect
robots.txt
: This file typically located at the root of a website (e.g.,https://www.trustpilot.com/robots.txt
) tells you which parts of the site the owner would prefer not to be accessed by bots.User-Agent String: Identify your scraper as a bot with a proper user-agent string. This allows the site owners to identify bot traffic and treat it accordingly.
Session Maintenance: Keep the session open rather than opening a new connection for each request, as this can reduce the load on the server.
Error Handling: Respect the HTTP status codes returned by the server. If you receive a 429 (Too Many Requests) or 503 (Service Unavailable), you should back off and retry after a respectful amount of time.
Caching: If you need to scrape the same pages multiple times, consider caching the responses locally so you do not need to repeat requests.
Headless Browsers: Be cautious with headless browsers like Puppeteer or Selenium; they are more resource-intensive for both the client and the server. Use them only when necessary, and close them properly after use to free up resources.
Concurrent Requests: Limit the number of concurrent requests to avoid overloading the server.
Here's an example of how you might scrape a website like Trustpilot in Python using the requests
module and BeautifulSoup
for parsing HTML. This example is purely educational; actual scraping code would depend on Trustpilot's structure and terms:
import time
import requests
from bs4 import BeautifulSoup
# Define the base URL of the page you want to scrape
base_url = 'https://www.trustpilot.com/review/example.com'
headers = {
'User-Agent': 'MyScraperBot/1.0 (+http://mywebsite.com)'
}
try:
# Make the request
response = requests.get(base_url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find the elements containing reviews, and loop through them to extract data
reviews = soup.find_all('div', class_='review-content')
for review in reviews:
# Extract and print the review text
review_text = review.find('p', class_='review-text').text
print(review_text)
# Be respectful and pause between requests
time.sleep(2)
else:
print(f'Error: {response.status_code}')
except Exception as e:
print(f'An error occurred: {e}')
In JavaScript (Node.js), you can use similar logic with axios
for HTTP requests and cheerio
for parsing HTML:
const axios = require('axios');
const cheerio = require('cheerio');
const baseURL = 'https://www.trustpilot.com/review/example.com';
const headers = {
'User-Agent': 'MyScraperBot/1.0 (+http://mywebsite.com)'
};
axios.get(baseURL, { headers })
.then(response => {
const html = response.data;
const $ = cheerio.load(html);
// Select the review elements and extract data
$('.review-content').each((i, elem) => {
const reviewText = $(elem).find('.review-text').text();
console.log(reviewText);
});
// Be respectful and avoid hammering the server with requests
setTimeout(() => {}, 2000);
})
.catch(error => {
console.error(`An error occurred: ${error}`);
});
Remember, this example code may not work if Trustpilot has anti-scraping measures or if the HTML structure is different. Always test and make sure you comply with the website's policies and legal requirements. If scraping is not permitted, contact Trustpilot or the review provider to see if there's an official API or data feed you can use.