Table of contents

How to Scrape Data from Websites with Infinite Scroll Using Python

Infinite scroll websites present unique challenges for web scraping because content loads dynamically as users scroll down the page. Unlike traditional pagination, these sites use JavaScript to fetch and append new content without page refreshes. This comprehensive guide covers multiple Python approaches to effectively scrape infinite scroll websites.

Understanding Infinite Scroll Mechanisms

Infinite scroll websites typically use one of these methods to load content:

  1. Scroll-triggered loading: New content loads when the user scrolls near the bottom
  2. Click-to-load: A "Load More" button triggers additional content
  3. Intersection Observer API: Modern approach that detects when certain elements become visible
  4. AJAX requests: Background HTTP requests fetch new data and update the DOM

Method 1: Using Selenium WebDriver

Selenium is the most reliable approach for infinite scroll scraping because it executes JavaScript and simulates real user behavior.

Basic Selenium Setup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time

# Configure Chrome options for headless browsing
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

driver = webdriver.Chrome(options=chrome_options)

Scroll-Based Loading Strategy

def scrape_infinite_scroll_by_scrolling(url, scroll_count=10):
    driver.get(url)

    # Wait for initial content to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "content-item"))
    )

    # Get initial page height
    last_height = driver.execute_script("return document.body.scrollHeight")

    items = []

    for i in range(scroll_count):
        # Scroll to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load
        time.sleep(2)

        # Calculate new scroll height and compare with last height
        new_height = driver.execute_script("return document.body.scrollHeight")

        # Extract data from newly loaded content
        content_elements = driver.find_elements(By.CLASS_NAME, "content-item")

        for element in content_elements[len(items):]:
            item_data = {
                'title': element.find_element(By.CLASS_NAME, "title").text,
                'description': element.find_element(By.CLASS_NAME, "description").text,
                'url': element.find_element(By.TAG_NAME, "a").get_attribute("href")
            }
            items.append(item_data)

        # Break if no new content loaded
        if new_height == last_height:
            print("No more content to load")
            break

        last_height = new_height

    driver.quit()
    return items

Advanced Scroll Detection

For more robust infinite scroll detection, use this enhanced approach:

def scrape_with_smart_scroll_detection(url):
    driver.get(url)

    # Wait for initial content
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "content-item"))
    )

    items = []
    scroll_attempts = 0
    max_attempts = 3

    while scroll_attempts < max_attempts:
        # Get current items count
        current_items = len(driver.find_elements(By.CLASS_NAME, "content-item"))

        # Scroll down gradually
        for i in range(3):
            driver.execute_script(f"window.scrollBy(0, {500 * (i + 1)});")
            time.sleep(1)

        # Wait for potential new content
        time.sleep(3)

        # Check if new items loaded
        new_items_count = len(driver.find_elements(By.CLASS_NAME, "content-item"))

        if new_items_count > current_items:
            # New content loaded, reset counter
            scroll_attempts = 0

            # Extract new items
            content_elements = driver.find_elements(By.CLASS_NAME, "content-item")
            for element in content_elements[len(items):]:
                try:
                    item_data = extract_item_data(element)
                    items.append(item_data)
                except Exception as e:
                    print(f"Error extracting item: {e}")
                    continue
        else:
            scroll_attempts += 1
            print(f"No new content loaded. Attempt {scroll_attempts}/{max_attempts}")

    driver.quit()
    return items

def extract_item_data(element):
    """Helper function to extract data from individual items"""
    return {
        'title': element.find_element(By.CLASS_NAME, "title").text,
        'description': element.find_element(By.CLASS_NAME, "description").text,
        'url': element.find_element(By.TAG_NAME, "a").get_attribute("href"),
        'image': element.find_element(By.TAG_NAME, "img").get_attribute("src")
    }

Method 2: Load More Button Automation

Some infinite scroll sites use "Load More" buttons instead of automatic scrolling:

def scrape_load_more_button(url):
    driver.get(url)

    items = []

    while True:
        # Wait for content to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "content-item"))
        )

        # Extract current page items
        content_elements = driver.find_elements(By.CLASS_NAME, "content-item")

        for element in content_elements[len(items):]:
            item_data = extract_item_data(element)
            items.append(item_data)

        # Look for Load More button
        try:
            load_more_button = WebDriverWait(driver, 5).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "load-more-btn"))
            )

            # Scroll to button and click
            driver.execute_script("arguments[0].scrollIntoView(true);", load_more_button)
            time.sleep(1)
            load_more_button.click()

            # Wait for new content to load
            time.sleep(3)

        except Exception:
            print("No more 'Load More' button found or clickable")
            break

    driver.quit()
    return items

Method 3: Intercepting AJAX Requests

For advanced users, intercepting network requests can be more efficient than DOM manipulation:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import json

def scrape_via_network_interception(url):
    # Enable logging
    caps = DesiredCapabilities.CHROME
    caps['goog:loggingPrefs'] = {'performance': 'ALL'}

    driver = webdriver.Chrome(desired_capabilities=caps, options=chrome_options)
    driver.get(url)

    # Wait for initial load
    time.sleep(5)

    # Get network logs
    logs = driver.get_log('performance')

    api_requests = []
    for log in logs:
        message = json.loads(log['message'])
        if message['message']['method'] == 'Network.responseReceived':
            url = message['message']['params']['response']['url']
            if 'api' in url and 'json' in message['message']['params']['response']['mimeType']:
                api_requests.append(url)

    driver.quit()

    # Use requests to fetch API data directly
    import requests

    all_data = []
    for api_url in api_requests:
        response = requests.get(api_url)
        if response.status_code == 200:
            data = response.json()
            all_data.extend(data.get('items', []))

    return all_data

Method 4: Hybrid Approach with Requests

Sometimes you can identify the AJAX endpoints and scrape them directly without a browser:

import requests
import json

def scrape_infinite_scroll_api(base_url, api_endpoint):
    """
    Scrape infinite scroll by directly calling the API endpoint
    """
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'application/json',
        'Referer': base_url
    })

    all_items = []
    page = 1

    while True:
        # Construct API URL with pagination
        api_url = f"{api_endpoint}?page={page}&limit=20"

        try:
            response = session.get(api_url)
            response.raise_for_status()

            data = response.json()
            items = data.get('items', [])

            if not items:
                print("No more items available")
                break

            all_items.extend(items)
            print(f"Fetched page {page}: {len(items)} items")

            # Check if there are more pages
            if data.get('has_more', False) is False:
                break

            page += 1
            time.sleep(1)  # Rate limiting

        except requests.RequestException as e:
            print(f"Error fetching page {page}: {e}")
            break

    return all_items

Error Handling and Best Practices

Robust Error Handling

from selenium.common.exceptions import TimeoutException, NoSuchElementException
import logging

def scrape_with_error_handling(url):
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    try:
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(url)

        items = []
        retry_count = 0
        max_retries = 3

        while retry_count < max_retries:
            try:
                # Your scraping logic here
                content_elements = WebDriverWait(driver, 10).until(
                    EC.presence_of_all_elements_located((By.CLASS_NAME, "content-item"))
                )

                for element in content_elements[len(items):]:
                    try:
                        item_data = extract_item_data(element)
                        items.append(item_data)
                    except NoSuchElementException as e:
                        logger.warning(f"Element not found: {e}")
                        continue

                # Scroll and check for new content
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(2)

                retry_count = 0  # Reset on success

            except TimeoutException:
                retry_count += 1
                logger.warning(f"Timeout occurred. Retry {retry_count}/{max_retries}")
                time.sleep(5)

        return items

    except Exception as e:
        logger.error(f"Fatal error: {e}")
        return []

    finally:
        if 'driver' in locals():
            driver.quit()

Rate Limiting and Respectful Scraping

import random

def scrape_with_rate_limiting(url, min_delay=1, max_delay=3):
    """
    Add random delays to appear more human-like
    """
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)

    items = []

    try:
        while True:
            # Extract current items
            new_items = extract_current_items(driver, len(items))
            items.extend(new_items)

            if not new_items:
                break

            # Scroll with human-like behavior
            scroll_height = random.randint(300, 800)
            driver.execute_script(f"window.scrollBy(0, {scroll_height});")

            # Random delay between actions
            delay = random.uniform(min_delay, max_delay)
            time.sleep(delay)

    finally:
        driver.quit()

    return items

Performance Optimization Tips

  1. Use headless browsing for faster execution
  2. Implement smart waiting strategies instead of fixed delays
  3. Extract data incrementally to avoid memory issues
  4. Consider using browser automation with Puppeteer for JavaScript-heavy sites
  5. Monitor network requests to identify direct API endpoints

Common Challenges and Solutions

Challenge 1: Content Not Loading

Solution: Increase wait times and implement explicit waits for specific elements.

Challenge 2: Anti-Bot Detection

Solution: Rotate user agents, add random delays, and use residential proxies.

Challenge 3: Memory Issues with Large Datasets

Solution: Process data in batches and write to files incrementally.

Browser Alternatives

While this guide focuses on Python, you might also consider handling dynamic content with modern browser automation tools for more complex scenarios.

Conclusion

Scraping infinite scroll websites requires understanding the underlying loading mechanism and choosing the appropriate technique. Selenium WebDriver provides the most reliable approach for complex sites, while direct API calls offer better performance when possible. Always implement proper error handling, rate limiting, and respect robots.txt guidelines.

Remember to test your scraping scripts thoroughly, as infinite scroll implementations can vary significantly between websites. Start with small-scale tests and gradually scale up while monitoring for any issues or changes in the site's behavior.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon