Table of contents

Is it possible to stream large files with urllib3?

Yes, urllib3 supports streaming large files efficiently without loading the entire content into memory. This is crucial for handling large downloads, processing files with limited memory resources, or building memory-efficient applications.

Basic File Streaming

The key to streaming with urllib3 is using preload_content=False in your request:

import urllib3

# Create a PoolManager instance
http = urllib3.PoolManager()

# Stream a large file
url = "https://example.com/largefile.zip"
response = http.request('GET', url, preload_content=False)

# Download in chunks
chunk_size = 8192  # 8KB chunks
with open('largefile.zip', 'wb') as out:
    while True:
        data = response.read(chunk_size)
        if not data:
            break
        out.write(data)

# Always release the connection
response.release_conn()

Advanced Streaming with Progress Tracking

Here's a more robust example with progress tracking and error handling:

import urllib3
from urllib3.util.retry import Retry

def download_large_file(url, filename, chunk_size=8192):
    # Configure retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )

    http = urllib3.PoolManager(retries=retry_strategy)

    try:
        # Get file size for progress tracking
        head_response = http.request('HEAD', url)
        total_size = int(head_response.headers.get('Content-Length', 0))

        # Stream the file
        response = http.request('GET', url, preload_content=False)

        downloaded = 0
        with open(filename, 'wb') as out:
            while True:
                data = response.read(chunk_size)
                if not data:
                    break

                out.write(data)
                downloaded += len(data)

                # Show progress
                if total_size > 0:
                    progress = (downloaded / total_size) * 100
                    print(f"Downloaded: {progress:.1f}%", end='\r')

        print(f"\nDownload completed: {filename}")

    except urllib3.exceptions.HTTPError as e:
        print(f"HTTP error occurred: {e}")
    except Exception as e:
        print(f"Error downloading file: {e}")
    finally:
        response.release_conn()

# Usage
download_large_file("https://example.com/largefile.zip", "local_file.zip")

Processing Streaming Data

You can also process streaming data without saving to disk:

import urllib3
import hashlib

def process_stream(url):
    http = urllib3.PoolManager()
    response = http.request('GET', url, preload_content=False)

    # Example: Calculate MD5 hash while streaming
    md5_hash = hashlib.md5()
    total_bytes = 0

    try:
        for chunk in response.stream(1024):
            md5_hash.update(chunk)
            total_bytes += len(chunk)

        print(f"File size: {total_bytes} bytes")
        print(f"MD5 hash: {md5_hash.hexdigest()}")

    finally:
        response.release_conn()

Key Parameters and Best Practices

Chunk Size Selection

  • Small files (< 1MB): 1024-4096 bytes
  • Medium files (1-100MB): 8192-65536 bytes
  • Large files (> 100MB): 65536-1048576 bytes

Important Notes

  1. Always use preload_content=False to enable streaming
  2. Always call response.release_conn() to prevent connection leaks
  3. Choose appropriate chunk sizes based on file size and memory constraints
  4. Handle network errors with retry strategies
  5. Use context managers when possible for automatic cleanup

Memory Benefits

Streaming with urllib3 keeps memory usage constant regardless of file size, making it ideal for: - Downloading large datasets - Processing log files - Handling media files - Building file proxy services - Working in memory-constrained environments

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon