Scraping job listings from Indeed while excluding sponsored posts requires you to identify the distinguishing HTML elements or attributes that separate regular listings from sponsored ones. Note that web scraping can be against the terms of service of the website, so always check Indeed's terms and conditions before you proceed. Also, keep in mind that web page structures change over time, so the solution might need adjustments in the future.
Here's a Python example using the requests
and BeautifulSoup
libraries to scrape Indeed job listings while excluding sponsored posts:
import requests
from bs4 import BeautifulSoup
# Base URL of the Indeed search results
url = 'https://www.indeed.com/jobs?q=software+developer&l='
# Perform the HTTP request to Indeed
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all job listings, assuming they are contained within <div> elements
# with a specific class name (this class name may change over time)
job_listings = soup.find_all('div', class_='jobsearch-SerpJobCard')
# Filter out sponsored posts by looking for a distinguishing attribute or class
# This example assumes sponsored posts contain 'sponsored' in the class name
non_sponsored_jobs = [job for job in job_listings if 'sponsored' not in job['class']]
# Process the non-sponsored job listings
for job in non_sponsored_jobs:
# Extract job information (e.g., title, company, location)
title = job.find('h2', class_='title').text.strip()
company = job.find('span', class_='company').text.strip()
location = job.find('div', class_='location').text.strip() if job.find('div', class_='location') else 'N/A'
# Print the job information
print(f'Job Title: {title}')
print(f'Company: {company}')
print(f'Location: {location}')
print('---')
else:
print('Failed to retrieve job listings')
Keep in mind that the class names (jobsearch-SerpJobCard
, title
, company
, etc.) are based on the current Indeed page structure, which may change. Always inspect the page source to determine the correct class names or attributes.
In JavaScript, you could use a headless browser like Puppeteer to navigate the website and scrape content. However, the following example is more complex and requires a Node.js environment:
const puppeteer = require('puppeteer');
(async () => {
// Launch the browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the Indeed search results
await page.goto('https://www.indeed.com/jobs?q=software+developer&l=');
// Evaluate the page's content to scrape job listings
const jobListings = await page.evaluate(() => {
// Function to scrape individual job details
const scrapeJob = (job) => {
const title = job.querySelector('h2.title').innerText.trim();
const company = job.querySelector('span.company').innerText.trim();
const location = job.querySelector('div.location') ? job.querySelector('div.location').innerText.trim() : 'N/A';
return { title, company, location };
};
// Get all job listings
const listings = Array.from(document.querySelectorAll('div.jobsearch-SerpJobCard'));
// Filter out sponsored posts
const nonSponsoredJobs = listings.filter(job => !job.className.includes('sponsored'));
// Map over non-sponsored jobs and scrape details
return nonSponsoredJobs.map(scrapeJob);
});
// Output the job listings
console.log(jobListings);
// Close the browser
await browser.close();
})();
This JavaScript code assumes that you have Puppeteer installed (npm install puppeteer
) and that Indeed's job listings are structured as in the Python example. The evaluate
function allows you to run JavaScript on the page to scrape and filter content.
Remember, always use web scraping responsibly, respect robots.txt, and consider using official APIs if available.