How do I handle pagination when scraping Yellow Pages?

When scraping websites like Yellow Pages, handling pagination is crucial to navigate through multiple pages of listings. Below are some general steps and code examples in Python using the requests library and BeautifulSoup for parsing HTML. Note that scraping websites should be done in accordance with their terms of service, and excessive requests can lead to your IP being blocked.

Steps to Handle Pagination:

  1. Identify the Pagination Pattern: Inspect the URL structure as you navigate through the pages. Pagination could be based on a query parameter (e.g., ?page=2) or part of the path (e.g., /page/2/).

  2. Scrape the First Page: Write code to fetch and parse the first page to extract the necessary information.

  3. Find the Next Page Link: Look for the HTML element that links to the next page. This could be a button or a simple link with text like "Next" or an arrow.

  4. Loop Through Pages: Build a loop that navigates from page to page by updating the URL or the parameter that controls the pagination.

  5. Handle Edge Cases: Make sure your code can handle the last page where there may not be a 'Next' link.

Python Example with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

# Base URL of the site
base_url = "https://www.yellowpages.com/search"

# Parameters for the query
params = {
    "search_terms": "plumber",
    "geo_location_terms": "New York, NY",
    "page": 1  # Pagination parameter
}

# Loop through pages
while True:
    response = requests.get(base_url, params=params)
    soup = BeautifulSoup(response.content, "html.parser")

    # Process listings on the current page
    listings = soup.find_all('div', class_='some-listing-class')  # Update the class based on actual listings
    for listing in listings:
        # Extract and print data from each listing
        print(listing.text)

    # Find the 'Next' page link or button, update the class or id based on actual pagination
    next_page = soup.find('a', class_='next')
    if next_page:
        # Update the page parameter to the next page number
        params["page"] += 1
    else:
        # No more pages, break the loop
        break

    # Optional: Delay between requests to avoid overloading the server
    time.sleep(1)

Remember, the class_ and params should be based on the actual HTML structure and URL parameters of Yellow Pages, which can change over time.

JavaScript Example with Node.js (using axios and cheerio):

For a Node.js environment, you can use axios to make HTTP requests and cheerio to parse HTML.

const axios = require('axios');
const cheerio = require('cheerio');

let currentPage = 1;
const baseUrl = 'https://www.yellowpages.com/search';

async function scrapePage(page) {
  const response = await axios.get(baseUrl, {
    params: {
      search_terms: 'plumber',
      geo_location_terms: 'New York, NY',
      page: page
    }
  });

  const $ = cheerio.load(response.data);

  // Process listings on the current page
  $('.some-listing-class').each((index, element) => {
    console.log($(element).text());
  });

  // Check if the 'Next' page link exists
  const hasNextPage = $('.next').length > 0;
  return hasNextPage;
}

async function scrapeAllPages() {
  while (await scrapePage(currentPage)) {
    currentPage++;
    // Optional: Delay between requests
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
}

scrapeAllPages();

In both examples, proper error handling and respecting the website's robots.txt should be implemented. Additionally, if you plan to scrape a large amount of data, consider using proxies to avoid IP bans and adhere to Yellow Pages' scraping policies. Always check the terms of service and legal requirements before scraping a website.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon