Web Scraping with Python and Selenium Picture
Scraping
11 minutes reading time

Web Scraping with Python and Selenium

Table of contents

Web scraping is the process of automatically extracting data from websites. While traditional scraping tools work well with static content, modern web applications often rely heavily on JavaScript, AJAX, and dynamic content loading. This is where Selenium becomes essential.

Selenium is a powerful automation framework that controls web browsers programmatically. Originally designed for testing web applications, Selenium has become the go-to tool for scraping JavaScript-heavy websites. By automating a real browser, Selenium can interact with dynamic content, handle user interactions, and extract data that traditional scraping tools miss.

In this comprehensive guide, we'll explore how to use Python with Selenium for web scraping, covering everything from basic setup to advanced techniques for handling real-world scenarios.

Prerequisites

Before diving into Selenium web scraping, you should have:

  • Basic Python knowledge: Variables, functions, loops, and object-oriented programming concepts
  • HTML fundamentals: Understanding of HTML structure, tags, attributes, and CSS selectors
  • Web development basics: How browsers work, HTTP requests, and JavaScript's role in web pages
  • Command line familiarity: Installing packages and running Python scripts

Why Choose Selenium for Web Scraping?

While libraries like BeautifulSoup and Requests are excellent for simple scraping tasks, Selenium offers unique advantages:

  • JavaScript execution: Handles dynamic content that loads after page rendering
  • User interaction simulation: Can click buttons, fill forms, and scroll pages
  • Browser automation: Works with any website as a real user would
  • Multiple browser support: Chrome, Firefox, Safari, and Edge compatibility
  • Debugging capabilities: Visual feedback and screenshot capture

Installation and Setup

Installing Selenium

The first step is installing Selenium for Python using pip:

pip install selenium

For better performance and additional features, also install:

pip install selenium webdriver-manager

The webdriver-manager package automatically downloads and manages browser drivers, eliminating manual setup.

Modern WebDriver Setup

In 2025, the recommended approach uses WebDriver Manager for automatic driver management:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Configure Chrome options
options = Options()
options.add_argument('--headless')  # Run in background
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Initialize driver with automatic driver management
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

WebDriver Architecture

Selenium uses a client-server architecture:

  1. Selenium Client: Your Python script with Selenium commands
  2. WebDriver: Browser-specific driver that translates commands
  3. Browser: The actual browser (Chrome, Firefox, etc.) that executes actions

This separation allows Selenium to work with different browsers using the same Python code.

Basic Browser Interaction

Loading a Web Page

The fundamental operation in web scraping is loading a web page. Here's how to do it with modern Selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Setup Chrome options
options = Options()
options.add_argument('--headless')  # Run without GUI
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

# Initialize driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

try:
    # Load a webpage
    driver.get("https://example.com")

    # Wait for page to load and get title
    print(f"Page title: {driver.title}")

    # Get current URL
    print(f"Current URL: {driver.current_url}")

finally:
    # Always close the driver
    driver.quit()

Essential Browser Configuration

Here are important Chrome options for web scraping:

options = Options()

# Performance optimizations
options.add_argument('--headless')  # No GUI
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')

# Anti-detection measures
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

# Set user agent
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')

# Window size for consistent rendering
options.add_argument('--window-size=1920,1080')

Element Location and Interaction

Finding elements on a web page is fundamental to web scraping. Selenium provides multiple strategies to locate elements, each suited for different scenarios.

Element Location Strategies

from selenium.webdriver.common.by import By

# Find single element (returns first match)
element = driver.find_element(By.ID, "element-id")

# Find multiple elements (returns list)
elements = driver.find_elements(By.CLASS_NAME, "item")

Here are all the element location strategies:

1. By ID (Most Reliable)

# HTML: <input id="username" type="text">
element = driver.find_element(By.ID, "username")

2. By Name

# HTML: <input name="email" type="email">
element = driver.find_element(By.NAME, "email")

3. By Class Name

# HTML: <div class="product-item">
elements = driver.find_elements(By.CLASS_NAME, "product-item")

4. By CSS Selector (Very Flexible)

# Complex selectors possible
price_elements = driver.find_elements(By.CSS_SELECTOR, ".product .price")
button = driver.find_element(By.CSS_SELECTOR, "button[data-action='submit']")

5. By XPath (Most Powerful)

# Absolute path
element = driver.find_element(By.XPATH, "/html/body/div[1]/form/input[2]")

# Relative path (preferred)
element = driver.find_element(By.XPATH, "//input[@placeholder='Search...']")

# Text-based selection
link = driver.find_element(By.XPATH, "//a[contains(text(), 'Click here')]")

6. By Tag Name

# Get all paragraphs
paragraphs = driver.find_elements(By.TAG_NAME, "p")
# Exact text match
link = driver.find_element(By.LINK_TEXT, "Contact Us")

# Partial text match
link = driver.find_element(By.PARTIAL_LINK_TEXT, "Contact")

Best Practices for Element Selection

  1. Prefer IDs when available - They're unique and fast
  2. Use CSS selectors for styling-based selection - More readable than XPath
  3. Use XPath for complex logic - When you need text content or complex relationships
  4. Avoid absolute XPaths - They break when page structure changes

Extracting Data from Elements

Once you've located elements, you need to extract useful information:

# Get element text
title = driver.find_element(By.TAG_NAME, "h1").text

# Get attribute values
link_url = driver.find_element(By.TAG_NAME, "a").get_attribute("href")
image_src = driver.find_element(By.TAG_NAME, "img").get_attribute("src")

# Get form input values
input_value = driver.find_element(By.ID, "search").get_attribute("value")

# Check element properties
is_displayed = element.is_displayed()
is_enabled = element.is_enabled()
is_selected = element.is_selected()  # For checkboxes/radio buttons

Waiting for Elements

Modern web applications load content dynamically. Proper waiting is crucial:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait up to 10 seconds for element to be clickable
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "submit-button")))

# Wait for element to be present
element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "results")))

# Wait for text to be present in element
wait.until(EC.text_to_be_present_in_element((By.ID, "status"), "Complete"))

Interacting with Elements

# Click elements
button = driver.find_element(By.ID, "submit")
button.click()

# Type into input fields
search_box = driver.find_element(By.NAME, "q")
search_box.clear()  # Clear existing text
search_box.send_keys("web scraping")

# Handle dropdowns
from selenium.webdriver.support.ui import Select
dropdown = Select(driver.find_element(By.ID, "country"))
dropdown.select_by_visible_text("United States")
dropdown.select_by_value("us")
dropdown.select_by_index(0)

Advanced Selenium Techniques

JavaScript Execution

Selenium can execute JavaScript directly in the browser, enabling powerful interactions:

# Execute JavaScript and get return value
page_height = driver.execute_script("return document.body.scrollHeight")

# Scroll to bottom of page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Click element using JavaScript (useful when normal click fails)
element = driver.find_element(By.ID, "hidden-button")
driver.execute_script("arguments[0].click();", element)

# Modify element properties
driver.execute_script("arguments[0].style.border='3px solid red'", element)

# Get data from complex JavaScript objects
data = driver.execute_script("""
    return {
        userAgent: navigator.userAgent,
        cookies: document.cookie,
        localStorage: Object.keys(localStorage).reduce((obj, key) => {
            obj[key] = localStorage.getItem(key);
            return obj;
        }, {})
    };
""")

Handling Dynamic Content

Many modern websites load content dynamically. Here's how to handle various scenarios:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Wait for AJAX content to load
def wait_for_ajax(driver, timeout=10):
    wait = WebDriverWait(driver, timeout)
    try:
        wait.until(lambda driver: driver.execute_script("return jQuery.active == 0"))
    except:
        pass  # jQuery might not be available

# Wait for specific element to appear
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
    )
except TimeoutException:
    print("Element didn't appear within 10 seconds")

# Infinite scroll handling
def scroll_to_load_all_content(driver):
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load
        time.sleep(2)

        # Check if new content loaded
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

Screenshots and Debugging

import time
from datetime import datetime

# Take full page screenshot
driver.save_screenshot(f"screenshot_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png")

# Take element screenshot (Selenium 4+)
element = driver.find_element(By.ID, "content")
element.screenshot("element_screenshot.png")

# Debug function to save screenshot on error
def debug_screenshot(driver, step_name):
    if not driver.get_log('browser'):  # Check if there are console errors
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        driver.save_screenshot(f"debug_{step_name}_{timestamp}.png")

Real-World Examples

Let's apply everything we've learned to practical web scraping scenarios.

Example 1: Form Automation

Here's a complete example of automating a web form:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

def automate_form():
    # Setup Chrome options
    options = Options()
    options.add_argument('--headless')  # Run in background
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    options.add_experimental_option('useAutomationExtension', False)

    # Initialize driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)

    try:
        # Navigate to form page
        driver.get("https://httpbin.org/forms/post")

        # Wait for page to load
        wait = WebDriverWait(driver, 10)

        # Fill out form fields
        email_field = wait.until(EC.presence_of_element_located((By.NAME, "email")))
        email_field.clear()
        email_field.send_keys("test@example.com")

        password_field = driver.find_element(By.NAME, "password")
        password_field.clear()
        password_field.send_keys("securepassword123")

        # Select from dropdown (if present)
        try:
            from selenium.webdriver.support.ui import Select
            dropdown = Select(driver.find_element(By.NAME, "size"))
            dropdown.select_by_visible_text("Medium")
        except:
            pass  # Dropdown might not exist

        # Submit form
        submit_button = driver.find_element(By.CSS_SELECTOR, "input[type='submit']")
        submit_button.click()

        # Wait for response and capture result
        time.sleep(2)
        page_title = driver.title
        print(f"Form submitted successfully. Page title: {page_title}")

        # Take screenshot of result
        driver.save_screenshot("form_result.png")

    except Exception as e:
        print(f"Error during form automation: {e}")
        driver.save_screenshot("error_screenshot.png")

    finally:
        driver.quit()

if __name__ == "__main__":
    automate_form()

Example 2: E-commerce Product Scraping

This example demonstrates scraping product information from an e-commerce website:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import json
import time

def scrape_products(search_term, max_products=10):
    """Scrape product information from a mock e-commerce site"""

    options = Options()
    options.add_argument('--headless')
    options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')

    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)

    products = []

    try:
        # Navigate to search results (example URL)
        search_url = f"https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets"
        driver.get(search_url)

        # Wait for products to load
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, "thumbnail")))

        # Find all product containers
        product_elements = driver.find_elements(By.CLASS_NAME, "thumbnail")

        for i, product in enumerate(product_elements[:max_products]):
            try:
                # Extract product information
                product_data = {}

                # Product name
                name_element = product.find_element(By.CLASS_NAME, "title")
                product_data['name'] = name_element.text.strip()

                # Product price
                price_element = product.find_element(By.CLASS_NAME, "price")
                product_data['price'] = price_element.text.strip()

                # Product description
                try:
                    desc_element = product.find_element(By.CLASS_NAME, "description")
                    product_data['description'] = desc_element.text.strip()
                except:
                    product_data['description'] = "No description available"

                # Product rating (if available)
                try:
                    rating_elements = product.find_elements(By.CSS_SELECTOR, ".ratings .glyphicon-star")
                    product_data['rating'] = len(rating_elements)
                except:
                    product_data['rating'] = "No rating"

                # Product link
                try:
                    link_element = product.find_element(By.TAG_NAME, "a")
                    product_data['url'] = link_element.get_attribute('href')
                except:
                    product_data['url'] = "No URL available"

                products.append(product_data)
                print(f"Scraped product {i+1}: {product_data['name']}")

            except Exception as e:
                print(f"Error scraping product {i+1}: {e}")
                continue

        # Scroll to load more products (if infinite scroll)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)

    except Exception as e:
        print(f"Error during scraping: {e}")

    finally:
        driver.quit()

    return products

def save_products_to_json(products, filename="products.json"):
    """Save scraped products to JSON file"""
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(products, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(products)} products to {filename}")

if __name__ == "__main__":
    # Scrape products
    products = scrape_products("tablets", max_products=5)

    # Display results
    for product in products:
        print(f"\nName: {product['name']}")
        print(f"Price: {product['price']}")
        print(f"Rating: {product['rating']}")
        print(f"Description: {product['description'][:100]}...")

    # Save to file
    save_products_to_json(products)

Example 3: Table Data Extraction

Here's an improved version of table scraping with proper error handling:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

def scrape_table_data(url, table_selector="table"):
    """Scrape table data and convert to pandas DataFrame"""

    options = Options()
    options.add_argument('--headless')

    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)

    try:
        driver.get(url)

        # Wait for table to load
        wait = WebDriverWait(driver, 10)
        table = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, table_selector)))

        # Extract headers
        headers = []
        try:
            header_elements = table.find_elements(By.CSS_SELECTOR, "thead tr th, tr:first-child td")
            headers = [header.text.strip() for header in header_elements]
        except:
            print("No headers found, using generic column names")

        # Extract rows
        rows = []
        row_elements = table.find_elements(By.CSS_SELECTOR, "tbody tr, tr")

        for row_element in row_elements:
            cells = row_element.find_elements(By.TAG_NAME, "td")
            if cells:  # Skip empty rows
                row_data = [cell.text.strip() for cell in cells]
                rows.append(row_data)

        # Create DataFrame
        if headers and len(headers) == len(rows[0]):
            df = pd.DataFrame(rows, columns=headers)
        else:
            df = pd.DataFrame(rows)

        return df

    except Exception as e:
        print(f"Error scraping table: {e}")
        return pd.DataFrame()

    finally:
        driver.quit()

# Example usage
if __name__ == "__main__":
    # Example with a public data table
    url = "https://webscraper.io/test-sites/tables"
    df = scrape_table_data(url)

    if not df.empty:
        print("Scraped table data:")
        print(df.head())

        # Save to CSV
        df.to_csv("scraped_table.csv", index=False)
        print("Data saved to scraped_table.csv")
    else:
        print("No data scraped")

This improved example includes:

  • Modern WebDriver setup with automatic driver management
  • Proper error handling and timeouts
  • Data export to CSV format
  • Flexible table selector options
  • Better code organization with functions

Best Practices and Tips

Performance Optimization

# Use headless mode for faster execution
options.add_argument('--headless')

# Disable images and CSS for faster loading
prefs = {
    "profile.managed_default_content_settings.images": 2,
    "profile.default_content_setting_values.notifications": 2,
    "profile.managed_default_content_settings.stylesheets": 2,
}
options.add_experimental_option("prefs", prefs)

# Set page load timeout
driver.set_page_load_timeout(30)

# Use implicit waits globally
driver.implicitly_wait(10)

Error Handling and Robustness

from selenium.common.exceptions import (
    TimeoutException, NoSuchElementException, 
    WebDriverException, ElementClickInterceptedException
)

def robust_element_interaction(driver, locator, action="click", timeout=10):
    """Robust element interaction with retry logic"""
    wait = WebDriverWait(driver, timeout)

    for attempt in range(3):  # Retry up to 3 times
        try:
            element = wait.until(EC.element_to_be_clickable(locator))

            if action == "click":
                element.click()
            elif action == "text":
                return element.text
            elif action == "send_keys":
                element.clear()
                element.send_keys(value)

            return element

        except (TimeoutException, ElementClickInterceptedException) as e:
            if attempt == 2:  # Last attempt
                raise e

            # Wait before retry
            time.sleep(1)

            # Try scrolling to element
            try:
                element = driver.find_element(*locator)
                driver.execute_script("arguments[0].scrollIntoView();", element)
            except:
                pass

Respecting Websites

import random
import time

def random_delay(min_seconds=1, max_seconds=3):
    """Add random delays to mimic human behavior"""
    time.sleep(random.uniform(min_seconds, max_seconds))

def check_robots_txt(base_url):
    """Check robots.txt before scraping"""
    try:
        robots_url = f"{base_url}/robots.txt"
        response = requests.get(robots_url)
        if response.status_code == 200:
            print(f"Robots.txt found: {robots_url}")
            print("Please respect robots.txt guidelines")
    except:
        pass

Troubleshooting Common Issues

Issue 1: WebDriver Not Found

Solution: Use WebDriver Manager for automatic driver management:

pip install webdriver-manager

Issue 2: Element Not Found

Solution: Use explicit waits and verify selectors:

# Instead of this
element = driver.find_element(By.ID, "button")

# Use this
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "button")))

Issue 3: Stale Element Reference

Solution: Re-locate elements after page changes:

# Refresh element reference after DOM changes
def get_fresh_element(driver, locator):
    return driver.find_element(*locator)

Issue 4: Bot Detection

Solution: Use stealth techniques:

options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

Conclusion

Selenium is an indispensable tool for modern web scraping, especially when dealing with JavaScript-heavy websites. In 2025, it remains the gold standard for browser automation and dynamic content extraction.

Key takeaways from this guide:

  • Modern setup: Use WebDriver Manager for hassle-free driver management
  • Robust scripting: Implement proper waits, error handling, and retry logic
  • Performance: Optimize browser settings for faster execution
  • Ethical scraping: Respect robots.txt and implement reasonable delays
  • Debugging: Use screenshots and logging for troubleshooting

When to use Selenium:

  • Websites with heavy JavaScript and AJAX content
  • Single Page Applications (SPAs)
  • Sites requiring user interactions (clicking, scrolling, form filling)
  • When you need to capture screenshots or visual elements

When to consider alternatives:

  • Simple static websites (use Requests + BeautifulSoup)
  • High-volume scraping where speed is critical (consider Scrapy)
  • API endpoints are available (always prefer APIs when possible)

The combination of Python and Selenium provides a powerful foundation for web automation and data extraction. As websites become increasingly complex, mastering these tools becomes essential for any data professional or automation engineer.

For the latest Selenium features and best practices, always refer to the official Selenium documentation.

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon