Table of contents

How do I scrape data from AJAX requests using Python?

AJAX (Asynchronous JavaScript and XML) requests are commonly used by modern websites to load content dynamically without refreshing the entire page. This presents unique challenges for web scraping, as traditional HTTP requests won't capture data that's loaded asynchronously after the initial page load. This guide covers multiple approaches to scrape AJAX data using Python.

Understanding AJAX Requests

AJAX requests are HTTP calls made by JavaScript code running in the browser. Unlike static HTML content, AJAX data is loaded after the initial page request, often triggered by user interactions or timers. When scraping AJAX content, you need to either:

  1. Intercept the actual AJAX requests and call them directly
  2. Use a browser automation tool to wait for content to load
  3. Execute JavaScript to trigger AJAX calls

Method 1: Using Selenium WebDriver

Selenium is the most straightforward approach for scraping AJAX content, as it provides a real browser environment where JavaScript can execute normally.

Basic Selenium Setup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time

# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run in background
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# Initialize driver
driver = webdriver.Chrome(options=chrome_options)

try:
    # Navigate to the page
    driver.get('https://example.com')

    # Wait for AJAX content to load
    wait = WebDriverWait(driver, 10)

    # Wait for specific element that indicates AJAX loading is complete
    ajax_content = wait.until(
        EC.presence_of_element_located((By.CLASS_NAME, "ajax-loaded-content"))
    )

    # Extract the data
    data = ajax_content.text
    print(f"AJAX Content: {data}")

finally:
    driver.quit()

Waiting for Dynamic Content

from selenium.webdriver.support import expected_conditions as EC

def wait_for_ajax_content(driver, timeout=10):
    """Wait for AJAX content using multiple strategies"""
    wait = WebDriverWait(driver, timeout)

    # Strategy 1: Wait for specific element
    try:
        element = wait.until(
            EC.presence_of_element_located((By.ID, "dynamic-content"))
        )
        return True
    except:
        pass

    # Strategy 2: Wait for jQuery to finish (if site uses jQuery)
    try:
        wait.until(lambda driver: driver.execute_script("return jQuery.active == 0"))
        return True
    except:
        pass

    # Strategy 3: Wait for custom loading indicator to disappear
    try:
        wait.until(
            EC.invisibility_of_element_located((By.CLASS_NAME, "loading-spinner"))
        )
        return True
    except:
        pass

    return False

# Usage
driver.get('https://example.com')
if wait_for_ajax_content(driver):
    # Extract data after AJAX loads
    content = driver.find_elements(By.CLASS_NAME, "ajax-item")
    for item in content:
        print(item.text)

Method 2: Intercepting Network Requests

For better performance and more precise control, you can intercept the actual AJAX requests and call them directly using the requests library.

Finding AJAX Endpoints

First, identify the AJAX endpoints using browser developer tools:

import requests
import json
from urllib.parse import urljoin, urlparse

def find_ajax_requests(base_url):
    """
    Use Selenium to monitor network requests and identify AJAX calls
    """
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

    # Enable logging
    caps = DesiredCapabilities.CHROME
    caps['goog:loggingPrefs'] = {'performance': 'ALL'}

    driver = webdriver.Chrome(desired_capabilities=caps)
    driver.get(base_url)

    # Wait for page to load
    time.sleep(5)

    # Get network logs
    logs = driver.get_log('performance')
    ajax_urls = []

    for log in logs:
        message = json.loads(log['message'])
        if message['message']['method'] == 'Network.responseReceived':
            url = message['message']['params']['response']['url']
            content_type = message['message']['params']['response'].get('mimeType', '')

            # Look for JSON responses (common for AJAX)
            if 'json' in content_type or '/api/' in url:
                ajax_urls.append(url)

    driver.quit()
    return ajax_urls

Direct AJAX Request Scraping

import requests
import json

class AjaxScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'X-Requested-With': 'XMLHttpRequest',  # Important for AJAX requests
        })

    def get_page_tokens(self, url):
        """Get initial page to extract tokens/cookies needed for AJAX"""
        response = self.session.get(url)
        # Extract CSRF tokens, session cookies, etc.
        return response

    def scrape_ajax_data(self, ajax_url, params=None, headers=None):
        """Make direct AJAX request"""
        if headers:
            self.session.headers.update(headers)

        try:
            response = self.session.get(ajax_url, params=params)
            response.raise_for_status()

            # Parse JSON response
            if 'application/json' in response.headers.get('content-type', ''):
                return response.json()
            else:
                return response.text

        except requests.RequestException as e:
            print(f"Error making AJAX request: {e}")
            return None

# Usage
scraper = AjaxScraper()

# First, visit the main page to get necessary cookies/tokens
scraper.get_page_tokens('https://example.com')

# Then make AJAX requests
ajax_data = scraper.scrape_ajax_data(
    'https://example.com/api/data',
    params={'page': 1, 'limit': 50}
)

if ajax_data:
    for item in ajax_data.get('results', []):
        print(f"Item: {item}")

Method 3: Using Requests-HTML

The requests-html library combines the simplicity of requests with JavaScript execution capabilities:

from requests_html import HTMLSession

def scrape_with_requests_html():
    session = HTMLSession()

    # Get the page
    r = session.get('https://example.com')

    # Render JavaScript (this will execute AJAX calls)
    r.html.render(wait=2, timeout=20)

    # Now extract data from the rendered content
    ajax_content = r.html.find('.ajax-loaded-content')

    for element in ajax_content:
        print(element.text)

    return ajax_content

# Usage
scrape_with_requests_html()

Method 4: Advanced Selenium Techniques

Monitoring Network Traffic

from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import json

def monitor_ajax_requests():
    # Enable performance logging
    caps = DesiredCapabilities.CHROME
    caps['goog:loggingPrefs'] = {'performance': 'ALL'}

    driver = webdriver.Chrome(desired_capabilities=caps)
    driver.get('https://example.com')

    # Trigger AJAX (e.g., click a button)
    trigger_button = driver.find_element(By.ID, "load-more")
    trigger_button.click()

    # Wait for AJAX to complete
    time.sleep(3)

    # Analyze network logs to find AJAX responses
    logs = driver.get_log('performance')

    for log in logs:
        message = json.loads(log['message'])

        if (message['message']['method'] == 'Network.responseReceived' and
            'json' in message['message']['params']['response'].get('mimeType', '')):

            # Get the request ID
            request_id = message['message']['params']['requestId']

            # Get response body
            try:
                response = driver.execute_cdp_cmd(
                    'Network.getResponseBody',
                    {'requestId': request_id}
                )

                ajax_data = json.loads(response['body'])
                print(f"AJAX Response: {ajax_data}")

            except Exception as e:
                print(f"Could not get response body: {e}")

    driver.quit()

Executing Custom JavaScript

def execute_ajax_with_javascript(driver):
    """Execute AJAX calls directly through JavaScript"""

    # Custom JavaScript to make AJAX request
    ajax_script = """
    return new Promise((resolve) => {
        fetch('/api/data', {
            method: 'GET',
            headers: {
                'Content-Type': 'application/json',
                'X-Requested-With': 'XMLHttpRequest'
            }
        })
        .then(response => response.json())
        .then(data => resolve(data))
        .catch(error => resolve({error: error.toString()}));
    });
    """

    # Execute the script and get result
    result = driver.execute_async_script(ajax_script)
    return result

Handling Common AJAX Patterns

Pagination with AJAX

def scrape_ajax_pagination():
    driver = webdriver.Chrome()
    driver.get('https://example.com')

    all_data = []
    page = 1

    while True:
        # Wait for current page to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "content-item"))
        )

        # Extract data from current page
        items = driver.find_elements(By.CLASS_NAME, "content-item")
        page_data = [item.text for item in items]
        all_data.extend(page_data)

        # Try to find and click next page
        try:
            next_button = driver.find_element(By.CLASS_NAME, "next-page")
            if not next_button.is_enabled():
                break

            next_button.click()
            page += 1

            # Wait for new content to load
            time.sleep(2)

        except:
            print("No more pages")
            break

    driver.quit()
    return all_data

Infinite Scroll

def scrape_infinite_scroll():
    driver = webdriver.Chrome()
    driver.get('https://example.com')

    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load
        time.sleep(2)

        # Check if new content loaded
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height

    # Extract all loaded content
    items = driver.find_elements(By.CLASS_NAME, "scroll-item")
    data = [item.text for item in items]

    driver.quit()
    return data

Best Practices and Tips

Performance Optimization

# Use headless browsing for better performance
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-images')  # Don't load images
chrome_options.add_argument('--disable-css')     # Don't load CSS

# Implement connection pooling for direct AJAX requests
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
    pool_connections=10,
    pool_maxsize=20
)
session.mount('http://', adapter)
session.mount('https://', adapter)

Error Handling

from selenium.common.exceptions import TimeoutException, NoSuchElementException

def robust_ajax_scraping(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            driver = webdriver.Chrome()
            driver.set_page_load_timeout(30)
            driver.get(url)

            # Wait for AJAX content
            wait = WebDriverWait(driver, 15)
            wait.until(
                EC.presence_of_element_located((By.CLASS_NAME, "ajax-content"))
            )

            # Extract data
            data = driver.find_elements(By.CLASS_NAME, "data-item")
            results = [item.text for item in data]

            driver.quit()
            return results

        except (TimeoutException, NoSuchElementException) as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if driver:
                driver.quit()

            if attempt == max_retries - 1:
                raise

            time.sleep(2 ** attempt)  # Exponential backoff

When to Use Each Method

Use Selenium when: - The AJAX requests are complex or require user interaction - You need to handle JavaScript-heavy single-page applications - The site uses complex authentication or session management

Use direct requests when: - You can identify the AJAX endpoints easily - Performance is critical - You're scraping large amounts of data - The AJAX requests follow predictable patterns

Use requests-html when: - You need a middle ground between requests and Selenium - The JavaScript execution requirements are moderate - You want simpler code than full Selenium setup

For more advanced browser automation techniques, you might also consider how to handle AJAX requests using Puppeteer if you're working in a Node.js environment, or explore how to monitor network requests in Puppeteer for detailed network traffic analysis.

By combining these techniques and choosing the right approach for your specific use case, you can effectively scrape data from even the most complex AJAX-powered websites.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon