When scraping websites like Yellow Pages, handling pagination is crucial to navigate through multiple pages of listings. Below are some general steps and code examples in Python using the requests
library and BeautifulSoup
for parsing HTML. Note that scraping websites should be done in accordance with their terms of service, and excessive requests can lead to your IP being blocked.
Steps to Handle Pagination:
Identify the Pagination Pattern: Inspect the URL structure as you navigate through the pages. Pagination could be based on a query parameter (e.g.,
?page=2
) or part of the path (e.g.,/page/2/
).Scrape the First Page: Write code to fetch and parse the first page to extract the necessary information.
Find the Next Page Link: Look for the HTML element that links to the next page. This could be a button or a simple link with text like "Next" or an arrow.
Loop Through Pages: Build a loop that navigates from page to page by updating the URL or the parameter that controls the pagination.
Handle Edge Cases: Make sure your code can handle the last page where there may not be a 'Next' link.
Python Example with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
# Base URL of the site
base_url = "https://www.yellowpages.com/search"
# Parameters for the query
params = {
"search_terms": "plumber",
"geo_location_terms": "New York, NY",
"page": 1 # Pagination parameter
}
# Loop through pages
while True:
response = requests.get(base_url, params=params)
soup = BeautifulSoup(response.content, "html.parser")
# Process listings on the current page
listings = soup.find_all('div', class_='some-listing-class') # Update the class based on actual listings
for listing in listings:
# Extract and print data from each listing
print(listing.text)
# Find the 'Next' page link or button, update the class or id based on actual pagination
next_page = soup.find('a', class_='next')
if next_page:
# Update the page parameter to the next page number
params["page"] += 1
else:
# No more pages, break the loop
break
# Optional: Delay between requests to avoid overloading the server
time.sleep(1)
Remember, the class_
and params
should be based on the actual HTML structure and URL parameters of Yellow Pages, which can change over time.
JavaScript Example with Node.js (using axios and cheerio):
For a Node.js environment, you can use axios
to make HTTP requests and cheerio
to parse HTML.
const axios = require('axios');
const cheerio = require('cheerio');
let currentPage = 1;
const baseUrl = 'https://www.yellowpages.com/search';
async function scrapePage(page) {
const response = await axios.get(baseUrl, {
params: {
search_terms: 'plumber',
geo_location_terms: 'New York, NY',
page: page
}
});
const $ = cheerio.load(response.data);
// Process listings on the current page
$('.some-listing-class').each((index, element) => {
console.log($(element).text());
});
// Check if the 'Next' page link exists
const hasNextPage = $('.next').length > 0;
return hasNextPage;
}
async function scrapeAllPages() {
while (await scrapePage(currentPage)) {
currentPage++;
// Optional: Delay between requests
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
scrapeAllPages();
In both examples, proper error handling and respecting the website's robots.txt
should be implemented. Additionally, if you plan to scrape a large amount of data, consider using proxies to avoid IP bans and adhere to Yellow Pages' scraping policies. Always check the terms of service and legal requirements before scraping a website.