When scraping Trustpilot reviews, handling pagination is crucial because reviews are often spread across multiple pages. Here's how to approach this in both Python and JavaScript.
Python (with BeautifulSoup and Requests)
Python is a popular choice for web scraping due to its simplicity and powerful libraries. BeautifulSoup is a library that makes it easy to scrape information from web pages.
- Install the libraries if you haven't done so already.
pip install requests beautifulsoup4
- Write the script to scrape multiple pages. Trustpilot might load reviews dynamically using JavaScript, so you may need to use browser dev tools to find the URL pattern for AJAX requests or use
selenium
to simulate a browser.
Here's a simple example using requests
and BeautifulSoup
to scrape a static site:
import requests
from bs4 import BeautifulSoup
def get_reviews(page):
url = f'https://www.trustpilot.com/review/example.com?page={page}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# You need to inspect the page to find the correct class or id for reviews
reviews = soup.find_all(class_='review_class')
return reviews
def scrape_trustpilot_reviews():
for page_number in range(1, 5): # Assuming you want to scrape the first 4 pages
reviews = get_reviews(page_number)
for review in reviews:
# Extract the data you need from each review
print(review.text)
scrape_trustpilot_reviews()
JavaScript (with Puppeteer or Cheerio)
JavaScript can be a good choice when you need to interact with a page that requires JavaScript to load its content. Puppeteer is a library that provides a high-level API over the Chrome DevTools Protocol.
- Install Puppeteer if you haven't done so already.
npm install puppeteer
- Write the script to scrape multiple pages using Puppeteer:
const puppeteer = require('puppeteer');
async function getReviews(page, pageNumber) {
const url = `https://www.trustpilot.com/review/example.com?page=${pageNumber}`;
await page.goto(url);
// You need to inspect the page to find the correct selector for reviews
const reviews = await page.$$eval('.review_selector', nodes => nodes.map(n => n.innerText));
return reviews;
}
async function scrapeTrustpilotReviews() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
for (let pageNumber = 1; pageNumber <= 4; pageNumber++) { // Assuming you want to scrape the first 4 pages
const reviews = await getReviews(page, pageNumber);
reviews.forEach(review => {
console.log(review);
});
}
await browser.close();
}
scrapeTrustpilotReviews();
Important Considerations
- Respect Robots.txt: Make sure to check Trustpilot's
robots.txt
file to understand the scraping rules that Trustpilot has set for its site. - Rate Limiting: Be courteous with the number of requests you send to avoid overwhelming the Trustpilot servers. Implement delays or respect the rate limits if provided.
- User-Agent: Set a user-agent to mimic a browser request.
- Legal and Ethical Considerations: Ensure that your scraping activities comply with Trustpilot's terms of service and local laws regarding data privacy and protection.
- Headless Browsers: When using tools like Puppeteer or Selenium, be aware that they can be detected by some sites, and you might be blocked. Use them responsibly.
- Dynamic Data Loading: Trustpilot might use JavaScript to load data dynamically. You might need to simulate clicks or scrolls using Puppeteer or Selenium to load all the reviews.
Remember, web scraping can be a legal gray area, and websites often change their layout and mechanisms to load content, which can break your scraper. Always keep your scripts updated and consider using official APIs if available for a more reliable solution.