Table of contents

How do I handle dynamic content in Scrapy?

How to Handle Dynamic Content in Scrapy

Dynamic content refers to HTML elements that are loaded or modified by JavaScript after the initial page load. While Scrapy excels at scraping static HTML, it doesn't execute JavaScript by default. However, there are several effective strategies to handle dynamic content in Scrapy.

1. Analyze Network Traffic First

Always start by checking the browser's Network tab before implementing complex solutions. Many dynamic elements load data via AJAX/XHR requests that return JSON or XML.

Steps to identify AJAX endpoints:

  1. Open browser Developer Tools (F12)
  2. Go to Network tab
  3. Filter by XHR/Fetch requests
  4. Reload the page and interact with dynamic content
  5. Look for API endpoints returning structured data

Example: Direct API scraping

import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'api_spider'

    def start_requests(self):
        # Direct request to discovered API endpoint
        api_url = 'https://example.com/api/data?page=1'
        yield scrapy.Request(
            url=api_url,
            headers={'X-Requested-With': 'XMLHttpRequest'},
            callback=self.parse_api
        )

    def parse_api(self, response):
        data = json.loads(response.text)
        for item in data['results']:
            yield {
                'title': item['title'],
                'price': item['price']
            }

2. Using Scrapy with Selenium

When JavaScript execution is necessary, integrate Selenium WebDriver with Scrapy for full browser automation.

Installation

pip install selenium scrapy-selenium

Complete Selenium integration example:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

class SeleniumSpider(scrapy.Spider):
    name = 'selenium_spider'
    start_urls = ['https://example.com']

    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # Run in background
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        self.driver = webdriver.Chrome(options=chrome_options)

    def parse(self, response):
        self.driver.get(response.url)

        # Wait for dynamic content to load
        wait = WebDriverWait(self.driver, 10)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'dynamic-content')))

        # Extract data after JavaScript execution
        items = self.driver.find_elements(By.CSS_SELECTOR, '.product-item')

        for item in items:
            yield {
                'title': item.find_element(By.CSS_SELECTOR, '.title').text,
                'price': item.find_element(By.CSS_SELECTOR, '.price').text
            }

        # Handle pagination
        try:
            next_button = self.driver.find_element(By.CSS_SELECTOR, '.next-page')
            if next_button.is_enabled():
                next_url = next_button.get_attribute('href')
                yield scrapy.Request(url=next_url, callback=self.parse)
        except:
            pass

    def closed(self, spider):
        self.driver.quit()

3. Using Scrapy with Splash

Splash is a lightweight JavaScript rendering service that integrates seamlessly with Scrapy.

Setup Splash with Docker

docker run -p 8050:8050 scrapinghub/splash

Installation and configuration

pip install scrapy-splash

Add to settings.py:

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

Advanced Splash example with Lua scripting:

import scrapy
from scrapy_splash import SplashRequest

class SplashSpider(scrapy.Spider):
    name = 'splash_spider'
    start_urls = ['https://example.com']

    def start_requests(self):
        # Lua script for complex interactions
        lua_script = """
        function main(splash, args)
          splash:go(args.url)
          splash:wait(2)

          -- Click load more button if present
          local load_more = splash:select('.load-more')
          if load_more then
            load_more:mouse_click()
            splash:wait(3)
          end

          return {
            html = splash:html(),
            png = splash:png(),
            har = splash:har(),
          }
        end
        """

        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                args={
                    'lua_source': lua_script,
                    'timeout': 60,
                    'html': 1,
                    'png': 1
                }
            )

    def parse(self, response):
        # Process the rendered HTML
        for item in response.css('.product-item'):
            yield {
                'title': item.css('.title::text').get(),
                'price': item.css('.price::text').get(),
                'image_url': item.css('img::attr(src)').get()
            }

4. Using Scrapy with Playwright (Modern Alternative)

Playwright offers better performance and reliability than Selenium for modern web applications.

Installation

pip install scrapy-playwright
playwright install

Playwright integration example:

import scrapy

class PlaywrightSpider(scrapy.Spider):
    name = 'playwright_spider'

    custom_settings = {
        'DOWNLOAD_HANDLERS': {
            'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
            'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
        },
        'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com',
            meta={
                'playwright': True,
                'playwright_page_methods': [
                    'wait_for_selector',
                    {'selector': '.dynamic-content', 'timeout': 10000}
                ]
            }
        )

    def parse(self, response):
        for item in response.css('.product-item'):
            yield {
                'title': item.css('.title::text').get(),
                'price': item.css('.price::text').get()
            }

Performance Considerations

| Method | Speed | Resource Usage | JavaScript Support | Complexity | |--------|-------|----------------|-------------------|------------| | AJAX Analysis | Fast | Low | N/A | Low | | Splash | Medium | Medium | Full | Medium | | Selenium | Slow | High | Full | High | | Playwright | Fast | Medium | Full | Medium |

Best Practices

  1. Always try AJAX first - It's the most efficient approach
  2. Use headless browsers to reduce resource consumption
  3. Implement proper waits - Don't rely on fixed delays
  4. Handle browser cleanup - Always quit drivers in spider close methods
  5. Monitor memory usage - Browser instances can consume significant resources
  6. Use connection pooling for Splash to improve performance

Troubleshooting Common Issues

Selenium WebDriver not found

# Install ChromeDriver
pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

Splash connection errors

# Check if Splash is running
curl http://localhost:8050/render.html?url=http://example.com

Element not found errors

from selenium.common.exceptions import TimeoutException

try:
    element = wait.until(EC.presence_of_element_located((By.ID, 'content')))
except TimeoutException:
    self.logger.warning("Dynamic content did not load within timeout period")

Dynamic content handling adds complexity and resource overhead to your scraping pipeline. Always evaluate whether the additional data justifies the performance cost, and consider using specialized tools like browser automation APIs when Scrapy integration becomes too complex.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon