When scraping Indeed job listings or any other website with pagination, you need to be able to navigate through multiple pages and extract the data from each page. Handling pagination typically involves finding the link to the next page or incrementing the page number in the URL and then making requests to each subsequent page until there are no more pages left. Below are Python and JavaScript examples illustrating how to handle pagination on Indeed job listings.
Python Example using requests
and BeautifulSoup
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://www.indeed.com/jobs"
PARAMS = {
'q': 'software engineer', # Your search query
'l': 'New York', # Location
'start': 0 # Pagination start
}
def get_job_listings(base_url, params):
while True:
response = requests.get(base_url, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
# Process the page
job_listings = soup.find_all('div', class_='jobsearch-SerpJobCard')
for job in job_listings:
# Extract job data
title = job.find('h2', class_='title').text.strip()
company = job.find('span', class_='company').text.strip()
print(f"Job Title: {title}, Company: {company}")
# Check for the 'Next' button - this may vary depending on Indeed's page structure
next_button = soup.find('a', {'aria-label': 'Next'})
if next_button and 'href' in next_button.attrs:
# Indeed uses a 'start' parameter to paginate
params['start'] += 10
else:
break # No more pages
# Start scraping
get_job_listings(BASE_URL, PARAMS)
In this Python example, we use requests
to make HTTP requests and BeautifulSoup
to parse the HTML content. We keep updating the 'start' parameter to move to the next set of listings until the 'Next' button is no longer found.
JavaScript Example using axios
and cheerio
In case you want to scrape Indeed job listings in a Node.js environment, you can use axios
to make HTTP requests and cheerio
for DOM parsing.
First, install the necessary packages:
npm install axios cheerio
Here's a Node.js example:
const axios = require('axios');
const cheerio = require('cheerio');
const BASE_URL = 'https://www.indeed.com/jobs';
let params = new URLSearchParams({
q: 'software engineer', // Your search query
l: 'New York', // Location
start: 0 // Pagination start
});
async function getJobListings(baseUrl, params) {
while (true) {
const response = await axios.get(baseUrl, { params });
const $ = cheerio.load(response.data);
// Process the page
$('.jobsearch-SerpJobCard').each((index, element) => {
const title = $(element).find('h2.title').text().trim();
const company = $(element).find('span.company').text().trim();
console.log(`Job Title: ${title}, Company: ${company}`);
});
// Check for the 'Next' button - this may vary depending on Indeed's page structure
const nextButton = $('a[aria-label="Next"]');
if (nextButton.length > 0) {
// Indeed uses a 'start' parameter to paginate
params.set('start', parseInt(params.get('start')) + 10);
} else {
break; // No more pages
}
}
}
// Start scraping
getJobListings(BASE_URL, params);
In this JavaScript example, we use axios
to make HTTP requests and cheerio
for jQuery-like syntax to parse the HTML. The start
parameter is incremented to navigate through the pages.
Things to Keep in Mind
- Always respect the website's
robots.txt
file and terms of service. Make sure that scraping is allowed and that you're not violating any terms. - Be mindful of the number of requests you make to avoid overwhelming the server. Consider adding delays between requests.
- Indeed may change its HTML structure, so you may need to update your selectors.
- Indeed's URL parameters or pagination system may change, so be prepared to adapt your script accordingly.
- Consider using Indeed's API if one is available, as it may be a more reliable and legal method for accessing job listing data.