How do I scrape Indeed job listings without including sponsored posts?

Scraping job listings from Indeed while excluding sponsored posts requires you to identify the distinguishing HTML elements or attributes that separate regular listings from sponsored ones. Note that web scraping can be against the terms of service of the website, so always check Indeed's terms and conditions before you proceed. Also, keep in mind that web page structures change over time, so the solution might need adjustments in the future.

Here's a Python example using the requests and BeautifulSoup libraries to scrape Indeed job listings while excluding sponsored posts:

import requests
from bs4 import BeautifulSoup

# Base URL of the Indeed search results
url = 'https://www.indeed.com/jobs?q=software+developer&l='

# Perform the HTTP request to Indeed
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all job listings, assuming they are contained within <div> elements
    # with a specific class name (this class name may change over time)
    job_listings = soup.find_all('div', class_='jobsearch-SerpJobCard')

    # Filter out sponsored posts by looking for a distinguishing attribute or class
    # This example assumes sponsored posts contain 'sponsored' in the class name
    non_sponsored_jobs = [job for job in job_listings if 'sponsored' not in job['class']]

    # Process the non-sponsored job listings
    for job in non_sponsored_jobs:
        # Extract job information (e.g., title, company, location)
        title = job.find('h2', class_='title').text.strip()
        company = job.find('span', class_='company').text.strip()
        location = job.find('div', class_='location').text.strip() if job.find('div', class_='location') else 'N/A'

        # Print the job information
        print(f'Job Title: {title}')
        print(f'Company: {company}')
        print(f'Location: {location}')
        print('---')

else:
    print('Failed to retrieve job listings')

Keep in mind that the class names (jobsearch-SerpJobCard, title, company, etc.) are based on the current Indeed page structure, which may change. Always inspect the page source to determine the correct class names or attributes.

In JavaScript, you could use a headless browser like Puppeteer to navigate the website and scrape content. However, the following example is more complex and requires a Node.js environment:

const puppeteer = require('puppeteer');

(async () => {
    // Launch the browser
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Navigate to the Indeed search results
    await page.goto('https://www.indeed.com/jobs?q=software+developer&l=');

    // Evaluate the page's content to scrape job listings
    const jobListings = await page.evaluate(() => {
        // Function to scrape individual job details
        const scrapeJob = (job) => {
            const title = job.querySelector('h2.title').innerText.trim();
            const company = job.querySelector('span.company').innerText.trim();
            const location = job.querySelector('div.location') ? job.querySelector('div.location').innerText.trim() : 'N/A';
            return { title, company, location };
        };

        // Get all job listings
        const listings = Array.from(document.querySelectorAll('div.jobsearch-SerpJobCard'));

        // Filter out sponsored posts
        const nonSponsoredJobs = listings.filter(job => !job.className.includes('sponsored'));

        // Map over non-sponsored jobs and scrape details
        return nonSponsoredJobs.map(scrapeJob);
    });

    // Output the job listings
    console.log(jobListings);

    // Close the browser
    await browser.close();
})();

This JavaScript code assumes that you have Puppeteer installed (npm install puppeteer) and that Indeed's job listings are structured as in the Python example. The evaluate function allows you to run JavaScript on the page to scrape and filter content.

Remember, always use web scraping responsibly, respect robots.txt, and consider using official APIs if available.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon