Best Free Proxy Lists for Web Scraping in 2025
Finding reliable free proxies for web scraping can be challenging. With thousands of proxy lists available online, most offer outdated, slow, or already-blocked IPs that waste your development time. This comprehensive guide evaluates the best free proxy sources in 2025, provides working code examples, and helps you implement robust proxy rotation systems.
Quick Start: Top 5 Free Proxy Sources
Before diving into details, here are the most reliable free proxy sources as of 2025:
- WebScraping.AI - 2,000 free API calls/month with premium proxies
- ProxyScrape API - Real-time aggregated proxy lists with filtering
- Free-Proxy-List.net - Updated every 10 minutes with 300+ proxies
- Proxy-List.download - Advanced filtering with multiple export formats
- GeoNode - 1GB free bandwidth with sticky session support
Now let's explore why you need proxies and how to use them effectively.
Why Proxies Are Essential for Web Scraping
IP Blocking Prevention
When you send multiple requests from a single IP address, websites quickly identify this as bot activity. Most sites will block your IP after detecting patterns like:
- Rapid sequential requests
- Consistent request intervals
- Inhuman browsing patterns
- High request volumes
Rate Limiting Circumvention
Websites implement rate limiting to protect their servers and data. Without proxies, you might be limited to:
- 1 request per second
- 100 requests per hour
- 1000 requests per day
With a pool of proxies, you can distribute requests across multiple IPs, effectively multiplying your allowed request rate.
Geographic Restrictions
Many websites serve different content based on location or block access from certain regions entirely. Proxies allow you to:
- Access geo-restricted content
- Compare prices across different regions
- Test localized versions of websites
- Bypass country-specific blocks
Legal and Privacy Protection
While web scraping legality varies by jurisdiction and use case, proxies add a layer of separation between your identity and scraping activities. This is particularly important in regions with strict data collection laws.
Understanding Free Proxy Quality
Free Proxy Statistics (2025 Data)
Based on our testing of 10,000+ free proxies:
Metric | Free Proxies | Paid Proxies |
Average Uptime | 12-48 hours | 30+ days |
Success Rate | 15-30% | 95-99% |
Average Speed | 0.5-2 MB/s | 10-100 MB/s |
Already Blocked | 60-80% | <5% |
Support HTTPS | 20-40% | 100% |
Common Free Proxy Issues
Free proxies come with significant limitations:
- Overuse: Popular free proxies are used by thousands of scrapers simultaneously
- Pre-blocked IPs: Many free proxies are already blacklisted by major websites
- Unreliability: Free proxies frequently go offline without warning
- Security risks: Some free proxies may log your data or inject malicious code
- Slow speeds: Shared bandwidth results in poor performance
- No authentication: Most free proxies lack username/password protection
- Limited protocols: Many only support HTTP, not HTTPS or SOCKS5
The key is finding the balance between accessibility and quality. Here are the most reliable sources:
Best Free Proxy Lists
1. WebScraping.AI (Best Overall)
Link: https://webscraping.ai
WebScraping.AI revolutionizes free proxy access by offering enterprise-grade infrastructure on their free tier. Unlike traditional proxy lists, they provide a managed service that handles all proxy complexity for you.
Key Features:
- 2,000 free API calls/month: No credit card required
- Automatic proxy rotation: Intelligent IP switching based on target site
- JavaScript rendering: Built-in headless browser support
- Residential & datacenter mix: Premium proxy pool typically costs $100s/month
- 99.9% uptime SLA: Even on free tier
- Global locations: 50+ countries available
- SSL/TLS support: Full HTTPS compatibility
- No blacklisted IPs: Continuously monitored and cleaned proxy pool
Quick Start (Python):
import requests
# Basic HTML scraping
url = "https://api.webscraping.ai/html"
params = {
"api_key": "YOUR_API_KEY",
"url": "https://example.com"
}
response = requests.get(url, params=params)
html = response.text
# JavaScript rendering with proxy rotation
params = {
"api_key": "YOUR_API_KEY",
"url": "https://example.com",
"js": True, # Enable JavaScript
"proxy": "residential" # Use residential proxies
}
response = requests.get(url, params=params)
Advanced Usage with Session Management:
class WebScrapingAI:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.webscraping.ai/html"
def scrape(self, url, **options):
params = {
"api_key": self.api_key,
"url": url,
**options
}
response = requests.get(self.base_url, params=params)
response.raise_for_status()
return response.text
def scrape_with_retry(self, url, max_retries=3, **options):
for attempt in range(max_retries):
try:
return self.scrape(url, **options)
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
# Usage
scraper = WebScrapingAI("YOUR_API_KEY")
html = scraper.scrape_with_retry(
"https://example.com",
js=True,
wait_for=".content", # Wait for specific element
timeout=30000
)
2. SSL Proxies Family (Best for Variety)
Links:
- https://sslproxies.org - SSL/HTTPS proxies only
- https://free-proxy-list.net - Mixed HTTP/HTTPS
- https://us-proxy.org - US-based proxies
- https://socks-proxy.net - SOCKS4/5 proxies
This network provides the largest variety of constantly updated proxy lists with different specializations.
Key Features:
- 300+ proxies per list: Typically 200-500 active proxies
- 10-minute updates: Fresh proxies added continuously
- Multiple protocols: HTTP, HTTPS, SOCKS4, SOCKS5
- Anonymity levels: Transparent, anonymous, and elite proxies marked
- Country filtering: Pre-filtered lists for US, UK, and other regions
- Last checked time: Shows when each proxy was verified
- Google compatibility: Marks proxies that work with Google
Automated Scraping Script:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import concurrent.futures
import time
class FreeProxyListScraper:
def __init__(self):
self.sources = {
'ssl': 'https://www.sslproxies.org/',
'free': 'https://free-proxy-list.net/',
'us': 'https://us-proxy.org/',
'socks': 'https://socks-proxy.net/'
}
def fetch_proxies(self, source='ssl'):
"""Fetch proxies from specified source"""
try:
response = requests.get(self.sources[source], timeout=10)
response.raise_for_status()
# Parse HTML table
df = pd.read_html(response.text)[0]
# Filter and format proxies
if source == 'ssl':
# Only HTTPS proxies
df = df[df['Https'] == 'yes']
elif source == 'socks':
# Only version 4 or 5
df = df[df['Version'].isin(['Socks4', 'Socks5'])]
return df
except Exception as e:
print(f"Error fetching from {source}: {e}")
return pd.DataFrame()
def get_elite_proxies(self):
"""Get only elite (high anonymity) proxies"""
all_proxies = []
for source in ['ssl', 'free', 'us']:
df = self.fetch_proxies(source)
if not df.empty:
elite = df[df['Anonymity'] == 'elite proxy']
all_proxies.append(elite)
if all_proxies:
combined = pd.concat(all_proxies, ignore_index=True)
# Remove duplicates
combined = combined.drop_duplicates(subset=['IP Address', 'Port'])
return combined
return pd.DataFrame()
def validate_proxy(self, ip, port, protocol='http'):
"""Test if proxy is working"""
proxy = f"{ip}:{port}"
proxies = {
'http': f'{protocol}://{proxy}',
'https': f'{protocol}://{proxy}'
}
try:
response = requests.get(
'http://httpbin.org/ip',
proxies=proxies,
timeout=5
)
if response.status_code == 200:
return {
'proxy': proxy,
'working': True,
'response_time': response.elapsed.total_seconds(),
'protocol': protocol
}
except:
pass
return {'proxy': proxy, 'working': False}
def get_working_proxies(self, max_workers=50):
"""Get all working proxies with parallel validation"""
print("Fetching proxy lists...")
df = self.get_elite_proxies()
if df.empty:
return []
print(f"Testing {len(df)} elite proxies...")
# Prepare proxy list for validation
proxies_to_test = []
for _, row in df.iterrows():
ip = row['IP Address']
port = row['Port']
protocol = 'https' if row.get('Https') == 'yes' else 'http'
proxies_to_test.append((ip, port, protocol))
# Parallel validation
working_proxies = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [
executor.submit(self.validate_proxy, ip, port, protocol)
for ip, port, protocol in proxies_to_test
]
for future in concurrent.futures.as_completed(futures):
result = future.result()
if result['working']:
working_proxies.append(result)
print(f"✓ Working: {result['proxy']} ({result['response_time']:.2f}s)")
return sorted(working_proxies, key=lambda x: x['response_time'])
# Usage
scraper = FreeProxyListScraper()
# Get working elite proxies
working = scraper.get_working_proxies(max_workers=100)
print(f"\nFound {len(working)} working proxies")
# Save to file
with open('working_proxies.txt', 'w') as f:
for proxy in working:
f.write(f"{proxy['proxy']}\n")
3. Proxy-List.download (Best API Access)
Link: https://proxy-list.download
This service excels with its comprehensive API offering direct access to filtered proxy lists without web scraping.
Key Features:
- 10,000+ proxies: One of the largest free databases
- RESTful API: Direct programmatic access
- Advanced filtering API: Filter by country, anonymity, protocol, speed
- Multiple formats: JSON, CSV, TXT, XML
- Ping time data: Latency measurements for each proxy
- Uptime tracking: Historical availability statistics
- No rate limits: Unlimited API calls on free tier
API Integration:
import requests
import json
from urllib.parse import urlencode
class ProxyListDownload:
def __init__(self):
self.base_url = "https://www.proxy-list.download/api/v1/get"
def get_proxies(self, **filters):
"""
Get proxies with filters:
- type: 'http', 'https', 'socks4', 'socks5'
- anon: 'transparent', 'anonymous', 'elite'
- country: 'US', 'GB', 'CA', etc.
- format: 'json', 'csv', 'txt'
"""
params = {
'format': 'json',
**filters
}
response = requests.get(self.base_url, params=params)
response.raise_for_status()
if params['format'] == 'json':
return response.json()
return response.text
def get_elite_proxies(self, countries=None):
"""Get only elite anonymity proxies"""
filters = {
'type': 'https',
'anon': 'elite',
'format': 'json'
}
if countries:
filters['country'] = ','.join(countries)
proxies = self.get_proxies(**filters)
# Sort by response time
return sorted(proxies, key=lambda x: float(x.get('responseTime', 999)))
def get_fast_proxies(self, max_ping=1000):
"""Get proxies with low latency"""
all_proxies = self.get_proxies(type='https', format='json')
fast_proxies = [
proxy for proxy in all_proxies
if float(proxy.get('responseTime', 9999)) < max_ping
]
return fast_proxies
def export_for_scrapy(self, proxies):
"""Format proxies for Scrapy middleware"""
scrapy_proxies = []
for proxy in proxies:
proxy_url = f"{proxy['protocol']}://{proxy['ip']}:{proxy['port']}"
scrapy_proxies.append({
'proxy': proxy_url,
'country': proxy.get('country', 'Unknown'),
'anonymity': proxy.get('anonymity', 'Unknown'),
'response_time': proxy.get('responseTime', 'Unknown')
})
return scrapy_proxies
# Usage examples
pl = ProxyListDownload()
# Get US-based elite proxies
us_proxies = pl.get_elite_proxies(countries=['US'])
print(f"Found {len(us_proxies)} US elite proxies")
# Get fast proxies (under 500ms)
fast_proxies = pl.get_fast_proxies(max_ping=500)
print(f"Found {len(fast_proxies)} fast proxies")
# Export for Scrapy
scrapy_list = pl.export_for_scrapy(fast_proxies[:10])
Bulk Download Script:
# Download all available proxy types
proxy_types = ['http', 'https', 'socks4', 'socks5']
for proxy_type in proxy_types:
url = f"https://www.proxy-list.download/api/v1/get?type={proxy_type}"
response = requests.get(url)
with open(f'{proxy_type}_proxies.txt', 'w') as f:
f.write(response.text)
print(f"Downloaded {proxy_type} proxies")
4. ProxyScrape (Best Real-time Updates)
Link: https://proxyscrape.com
ProxyScrape aggregates proxies from 50+ sources in real-time, providing one of the most comprehensive free proxy databases.
Key Features:
- 5,000+ proxies: Aggregated from multiple sources
- Real-time updates: New proxies added every 30 seconds
- Advanced API v2: Powerful filtering and format options
- Proxy checker: Built-in validation service
- WebSocket support: Real-time proxy feed
- Timeout filtering: Get only fast-responding proxies
- SSL verification: Separate HTTPS-capable proxy lists
Complete API Client:
import requests
import asyncio
import aiohttp
from typing import List, Dict, Optional
class ProxyScrapeClient:
def __init__(self):
self.base_url = "https://api.proxyscrape.com/v2/"
self.checker_url = "https://api.proxyscrape.com/v2/checker"
def get_proxies(
self,
protocol: str = "http",
timeout: int = 10000,
country: Optional[str] = None,
ssl: Optional[str] = None,
anonymity: Optional[str] = None,
format: str = "json"
) -> List[Dict]:
"""
Get proxies with advanced filtering
Args:
protocol: 'http', 'socks4', 'socks5', 'all'
timeout: Max timeout in milliseconds (1000-10000)
country: ISO country code (e.g., 'us', 'gb')
ssl: 'yes', 'no', 'all'
anonymity: 'elite', 'anonymous', 'transparent', 'all'
format: 'json', 'textplain', 'csv'
"""
params = {
"request": "get",
"protocol": protocol,
"timeout": timeout,
"format": format
}
# Add optional filters
if country:
params["country"] = country
if ssl:
params["ssl"] = ssl
if anonymity:
params["anonymity"] = anonymity
response = requests.get(self.base_url, params=params)
response.raise_for_status()
if format == "json":
return response.json()
return response.text
async def check_proxy(self, session: aiohttp.ClientSession, proxy: str) -> Dict:
"""Async proxy checker"""
try:
async with session.get(
self.checker_url,
params={"proxy": proxy},
timeout=aiohttp.ClientTimeout(total=10)
) as response:
if response.status == 200:
data = await response.json()
return {
"proxy": proxy,
"working": data.get("working", False),
"protocol": data.get("protocol"),
"anonymity": data.get("anonymity"),
"country": data.get("country"),
"response_time": data.get("timeout")
}
except:
pass
return {"proxy": proxy, "working": False}
async def bulk_check_proxies(self, proxies: List[str]) -> List[Dict]:
"""Check multiple proxies asynchronously"""
async with aiohttp.ClientSession() as session:
tasks = [self.check_proxy(session, proxy) for proxy in proxies]
results = await asyncio.gather(*tasks)
return [r for r in results if r["working"]]
def get_premium_proxies(self) -> List[Dict]:
"""Get highest quality proxies"""
# Get elite HTTPS proxies with low timeout
proxies = self.get_proxies(
protocol="http",
timeout=5000, # 5 seconds max
ssl="yes",
anonymity="elite"
)
# Further filter by response time if available
if isinstance(proxies, list):
return sorted(
proxies,
key=lambda x: x.get("timeout", 9999)
)[:50] # Top 50 fastest
return proxies
# Synchronous usage
client = ProxyScrapeClient()
# Get US elite proxies
us_proxies = client.get_proxies(
country="us",
anonymity="elite",
ssl="yes"
)
print(f"Found {len(us_proxies)} US elite proxies")
# Async proxy validation
async def validate_proxies():
# Get proxies as text list
proxy_text = client.get_proxies(format="textplain")
proxy_list = proxy_text.strip().split('\n')[:20] # Test first 20
# Validate in parallel
working = await client.bulk_check_proxies(proxy_list)
print(f"Found {len(working)} working proxies out of {len(proxy_list)}")
return working
# Run async validation
# working_proxies = asyncio.run(validate_proxies())
Integration with Popular Libraries:
# Requests integration
def get_proxyscrape_session(country="us", timeout=5):
"""Get requests session with ProxyScrape proxies"""
client = ProxyScrapeClient()
proxies = client.get_proxies(
country=country,
anonymity="elite",
ssl="yes",
format="json"
)
if proxies:
proxy = proxies[0] # Use first proxy
proxy_url = f"http://{proxy['ip']}:{proxy['port']}"
session = requests.Session()
session.proxies = {
'http': proxy_url,
'https': proxy_url
}
session.timeout = timeout
return session
return requests.Session()
# Scrapy integration
PROXYSCRAPE_SETTINGS = {
'ROTATING_PROXY_LIST_PATH': 'proxyscrape_proxies.txt',
'ROTATING_PROXY_PAGE_RETRY_TIMES': 2,
'ROTATING_PROXY_CLOSE_SPIDER': False,
'DOWNLOADER_MIDDLEWARES': {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
}
# Download fresh proxy list for Scrapy
client = ProxyScrapeClient()
proxies = client.get_proxies(format="textplain", timeout=5000)
with open("proxyscrape_proxies.txt", "w") as f:
f.write(proxies)
5. GeoNode (Best Free Bandwidth)
Link: https://geonode.com
GeoNode offers a unique free tier with actual bandwidth allocation rather than request limits.
Key Features:
- 1GB free bandwidth/month: More generous than request-based limits
- Sticky sessions: Maintain same IP for up to 30 minutes
- Residential proxies: Access to residential IPs on free tier
- 100+ countries: Wide geographic coverage
- Username/password auth: Secure authentication included
- HTTPS/SOCKS5 support: Full protocol compatibility
Setup and Usage:
import requests
from requests.auth import HTTPProxyAuth
class GeoNodeProxy:
def __init__(self, username, password):
self.username = username
self.password = password
self.endpoint = "premium-residential.geonode.com:6060"
def get_proxy_dict(self, country=None):
"""Get proxy configuration dict"""
# Country-specific proxy format
if country:
proxy_username = f"{self.username}-country-{country}"
else:
proxy_username = self.username
proxy_url = f"http://{proxy_username}:{self.password}@{self.endpoint}"
return {
'http': proxy_url,
'https': proxy_url
}
def create_session(self, country=None, sticky=True):
"""Create requests session with proxy"""
session = requests.Session()
if sticky:
# Add sticky session identifier
import random
session_id = random.randint(10000, 99999)
username = f"{self.username}-session-{session_id}"
if country:
username += f"-country-{country}"
else:
username = self.username
if country:
username += f"-country-{country}"
proxy_url = f"http://{username}:{self.password}@{self.endpoint}"
session.proxies = {
'http': proxy_url,
'https': proxy_url
}
return session
# Usage
geonode = GeoNodeProxy("your_username", "your_password")
# Simple request
proxies = geonode.get_proxy_dict(country="US")
response = requests.get("https://httpbin.org/ip", proxies=proxies)
# Sticky session for multiple requests
session = geonode.create_session(country="UK", sticky=True)
for url in ["https://example.com/page1", "https://example.com/page2"]:
response = session.get(url)
# Same IP will be used for both requests
Additional Free Proxy Sources
6. ProxyNova
Link: https://www.proxynova.com
- Country-specific proxy lists
- Uptime percentage displayed
- Speed test results
7. HideMy.name
Link: https://hidemy.name/en/proxy-list/
- Advanced filtering interface
- Response time graphs
- Export to various formats
8. OpenProxy.space
Link: https://openproxy.space
- Daily updated lists
- Socks5 proxy focus
- Simple text format
Automated Proxy Collection
Multi-Source Proxy Aggregator
Combine multiple free proxy sources for maximum coverage:
import asyncio
import aiohttp
from typing import List, Dict, Set
import json
from datetime import datetime
class ProxyAggregator:
def __init__(self):
self.sources = {
'proxyscrape': self._fetch_proxyscrape,
'proxylist': self._fetch_proxylist,
'freeproxylist': self._fetch_freeproxylist,
'geonode': self._fetch_geonode_free_list
}
self.all_proxies = set()
async def _fetch_proxyscrape(self, session: aiohttp.ClientSession) -> List[str]:
"""Fetch from ProxyScrape API"""
url = "https://api.proxyscrape.com/v2/"
params = {
"request": "get",
"protocol": "http",
"timeout": 5000,
"format": "textplain",
"anonymity": "elite"
}
try:
async with session.get(url, params=params) as response:
text = await response.text()
return text.strip().split('\n')
except:
return []
async def _fetch_proxylist(self, session: aiohttp.ClientSession) -> List[str]:
"""Fetch from proxy-list.download"""
url = "https://www.proxy-list.download/api/v1/get"
params = {"type": "https", "anon": "elite"}
try:
async with session.get(url, params=params) as response:
text = await response.text()
return text.strip().split('\n')
except:
return []
async def _fetch_freeproxylist(self, session: aiohttp.ClientSession) -> List[str]:
"""Parse free-proxy-list.net"""
# Note: This would require HTML parsing
# Simplified for example
return []
async def _fetch_geonode_free_list(self, session: aiohttp.ClientSession) -> List[str]:
"""Get GeoNode's free proxy list"""
# Note: Check their free proxy list page
return []
async def aggregate_proxies(self) -> Set[str]:
"""Fetch proxies from all sources concurrently"""
async with aiohttp.ClientSession() as session:
tasks = [
source_func(session)
for source_func in self.sources.values()
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Combine all results
all_proxies = set()
for result in results:
if isinstance(result, list):
all_proxies.update(result)
self.all_proxies = all_proxies
return all_proxies
async def validate_proxy(self, session: aiohttp.ClientSession, proxy: str) -> Dict:
"""Validate single proxy"""
test_url = "http://httpbin.org/ip"
proxy_url = f"http://{proxy}"
try:
start_time = datetime.now()
async with session.get(
test_url,
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=5)
) as response:
if response.status == 200:
data = await response.json()
response_time = (datetime.now() - start_time).total_seconds()
return {
"proxy": proxy,
"working": True,
"response_time": response_time,
"external_ip": data.get("origin")
}
except:
pass
return {"proxy": proxy, "working": False}
async def get_working_proxies(self, max_workers: int = 100) -> List[Dict]:
"""Aggregate and validate proxies from all sources"""
print("Aggregating proxies from all sources...")
all_proxies = await self.aggregate_proxies()
print(f"Found {len(all_proxies)} unique proxies")
print("Validating proxies...")
async with aiohttp.ClientSession() as session:
# Create validation tasks with semaphore to limit concurrency
semaphore = asyncio.Semaphore(max_workers)
async def validate_with_limit(proxy):
async with semaphore:
return await self.validate_proxy(session, proxy)
tasks = [validate_with_limit(proxy) for proxy in all_proxies]
results = await asyncio.gather(*tasks)
# Filter working proxies and sort by response time
working = [r for r in results if r["working"]]
working.sort(key=lambda x: x["response_time"])
return working
def save_results(self, proxies: List[Dict], filename: str = "aggregated_proxies.json"):
"""Save validated proxies to file"""
with open(filename, 'w') as f:
json.dump({
"timestamp": datetime.now().isoformat(),
"total_tested": len(self.all_proxies),
"working_count": len(proxies),
"proxies": proxies
}, f, indent=2)
# Usage
async def main():
aggregator = ProxyAggregator()
working_proxies = await aggregator.get_working_proxies(max_workers=200)
print(f"\nFound {len(working_proxies)} working proxies")
print("\nTop 10 fastest proxies:")
for proxy in working_proxies[:10]:
print(f" {proxy['proxy']} - {proxy['response_time']:.2f}s")
# Save results
aggregator.save_results(working_proxies)
# Run aggregator
if __name__ == "__main__":
asyncio.run(main())
GitHub Proxy Lists
Many developers maintain curated proxy lists on GitHub:
# Popular GitHub proxy lists
github_proxy_lists = [
"https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt",
"https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt",
"https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/proxy.txt",
"https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt"
]
async def fetch_github_lists():
"""Fetch proxies from GitHub repositories"""
all_proxies = set()
async with aiohttp.ClientSession() as session:
for url in github_proxy_lists:
try:
async with session.get(url) as response:
text = await response.text()
proxies = text.strip().split('\n')
all_proxies.update(proxies)
print(f"Fetched {len(proxies)} from {url}")
except:
print(f"Failed to fetch {url}")
return list(all_proxies)
Advanced Proxy Validation
Enterprise-Grade Proxy Validator
Build a robust validation system with detailed metrics:
import asyncio
import aiohttp
import time
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from enum import Enum
import ssl
import certifi
class ProxyType(Enum):
HTTP = "http"
HTTPS = "https"
SOCKS4 = "socks4"
SOCKS5 = "socks5"
@dataclass
class ProxyTestResult:
proxy: str
working: bool
proxy_type: Optional[ProxyType] = None
response_time: Optional[float] = None
anonymity_level: Optional[str] = None
country: Optional[str] = None
error: Optional[str] = None
supports_https: bool = False
external_ip: Optional[str] = None
class AdvancedProxyValidator:
def __init__(self):
self.test_endpoints = {
'basic': 'http://httpbin.org/ip',
'https': 'https://httpbin.org/ip',
'headers': 'http://httpbin.org/headers',
'google': 'https://www.google.com/robots.txt',
'cloudflare': 'https://www.cloudflare.com/robots.txt'
}
async def validate_proxy(
self,
proxy: str,
session: aiohttp.ClientSession,
full_test: bool = False
) -> ProxyTestResult:
"""Comprehensive proxy validation"""
# Parse proxy format
if '://' in proxy:
proxy_url = proxy
proxy_type = proxy.split('://')[0]
else:
proxy_url = f'http://{proxy}'
proxy_type = 'http'
result = ProxyTestResult(proxy=proxy, working=False)
try:
# Basic connectivity test
start_time = time.time()
async with session.get(
self.test_endpoints['basic'],
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=10),
ssl=False # Disable SSL verification for initial test
) as response:
if response.status == 200:
result.working = True
result.response_time = time.time() - start_time
# Get external IP
data = await response.json()
result.external_ip = data.get('origin', '').split(',')[0].strip()
if full_test:
# Additional tests
await self._test_anonymity(proxy_url, session, result)
await self._test_https_support(proxy_url, session, result)
await self._test_site_compatibility(proxy_url, session, result)
except asyncio.TimeoutError:
result.error = "Timeout"
except aiohttp.ClientProxyConnectionError:
result.error = "Connection failed"
except Exception as e:
result.error = str(e)[:50]
return result
async def _test_anonymity(
self,
proxy_url: str,
session: aiohttp.ClientSession,
result: ProxyTestResult
):
"""Check proxy anonymity level"""
try:
async with session.get(
self.test_endpoints['headers'],
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=5)
) as response:
if response.status == 200:
data = await response.json()
headers = data.get('headers', {})
# Check for revealing headers
revealing_headers = [
'X-Forwarded-For',
'X-Real-Ip',
'Via',
'X-Proxy-Id'
]
found_headers = [h for h in revealing_headers if h in headers]
if not found_headers:
result.anonymity_level = "Elite"
elif 'Via' in found_headers and len(found_headers) == 1:
result.anonymity_level = "Anonymous"
else:
result.anonymity_level = "Transparent"
except:
pass
async def _test_https_support(
self,
proxy_url: str,
session: aiohttp.ClientSession,
result: ProxyTestResult
):
"""Test HTTPS support"""
try:
ssl_context = ssl.create_default_context(cafile=certifi.where())
async with session.get(
self.test_endpoints['https'],
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=5),
ssl=ssl_context
) as response:
result.supports_https = response.status == 200
except:
result.supports_https = False
async def _test_site_compatibility(
self,
proxy_url: str,
session: aiohttp.ClientSession,
result: ProxyTestResult
):
"""Test compatibility with major sites"""
# Quick test against Google
try:
async with session.get(
self.test_endpoints['google'],
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=5),
headers={'User-Agent': 'Mozilla/5.0 (compatible; ProxyTest/1.0)'}
) as response:
if response.status == 200:
result.supports_google = True
except:
pass
async def bulk_validate(
self,
proxies: List[str],
max_concurrent: int = 100,
full_test: bool = False
) -> List[ProxyTestResult]:
"""Validate multiple proxies with rate limiting"""
# Create session with custom connector
connector = aiohttp.TCPConnector(
limit=max_concurrent,
force_close=True,
enable_cleanup_closed=True
)
async with aiohttp.ClientSession(connector=connector) as session:
# Use semaphore for rate limiting
semaphore = asyncio.Semaphore(max_concurrent)
async def validate_with_limit(proxy: str):
async with semaphore:
return await self.validate_proxy(proxy, session, full_test)
# Validate all proxies
tasks = [validate_with_limit(proxy) for proxy in proxies]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Filter out exceptions
valid_results = []
for result in results:
if isinstance(result, ProxyTestResult):
valid_results.append(result)
else:
# Handle exception
print(f"Validation error: {result}")
return valid_results
def filter_results(
self,
results: List[ProxyTestResult],
min_speed: Optional[float] = None,
anonymity: Optional[str] = None,
https_only: bool = False
) -> List[ProxyTestResult]:
"""Filter results based on criteria"""
filtered = [r for r in results if r.working]
if min_speed:
filtered = [r for r in filtered if r.response_time and r.response_time <= min_speed]
if anonymity:
filtered = [r for r in filtered if r.anonymity_level == anonymity]
if https_only:
filtered = [r for r in filtered if r.supports_https]
return sorted(filtered, key=lambda x: x.response_time or 999)
# Usage example
async def test_proxies():
validator = AdvancedProxyValidator()
# Your proxy list
proxy_list = [
"123.45.67.89:8080",
"98.76.54.32:3128",
# ... more proxies
]
print("Starting proxy validation...")
results = await validator.bulk_validate(
proxy_list,
max_concurrent=200,
full_test=True # Enable comprehensive testing
)
# Filter for elite HTTPS proxies
elite_https = validator.filter_results(
results,
min_speed=2.0, # Under 2 seconds
anonymity="Elite",
https_only=True
)
print(f"\nValidation complete:")
print(f"Total tested: {len(proxy_list)}")
print(f"Working: {len([r for r in results if r.working])}")
print(f"Elite HTTPS: {len(elite_https)}")
# Display top proxies
print("\nTop 5 Elite HTTPS Proxies:")
for result in elite_https[:5]:
print(f" {result.proxy} - {result.response_time:.2f}s - {result.external_ip}")
# Export results
import json
with open("validated_proxies.json", "w") as f:
json.dump([
{
"proxy": r.proxy,
"response_time": r.response_time,
"anonymity": r.anonymity_level,
"https": r.supports_https,
"external_ip": r.external_ip
}
for r in elite_https
], f, indent=2)
# Run validation
if __name__ == "__main__":
asyncio.run(test_proxies())
Intelligent Proxy Rotation
Advanced Proxy Rotation System
Implement a sophisticated proxy rotation system with performance tracking and intelligent selection:
import asyncio
import aiohttp
import random
import time
from collections import defaultdict, deque
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
import heapq
import json
@dataclass
class ProxyStats:
"""Track detailed proxy performance metrics"""
total_requests: int = 0
successful_requests: int = 0
failed_requests: int = 0
total_response_time: float = 0.0
last_used: Optional[datetime] = None
last_failed: Optional[datetime] = None
consecutive_failures: int = 0
@property
def success_rate(self) -> float:
if self.total_requests == 0:
return 0.0
return self.successful_requests / self.total_requests
@property
def average_response_time(self) -> float:
if self.successful_requests == 0:
return float('inf')
return self.total_response_time / self.successful_requests
class DecodoproxyRotator:
def __init__(
self,
proxies: List[str],
max_failures: int = 3,
cooldown_minutes: int = 30,
min_delay_between_uses: float = 1.0
):
self.proxies = set(proxies)
self.stats: Dict[str, ProxyStats] = defaultdict(ProxyStats)
self.max_failures = max_failures
self.cooldown_period = timedelta(minutes=cooldown_minutes)
self.min_delay = min_delay_between_uses
self.blacklist: Dict[str, datetime] = {}
# Priority queue for proxy selection (lower score = higher priority)
self.proxy_queue = []
self._initialize_queue()
def _initialize_queue(self):
"""Initialize priority queue with all proxies"""
for proxy in self.proxies:
# Initial score of 0 for unused proxies
heapq.heappush(self.proxy_queue, (0, time.time(), proxy))
def _calculate_proxy_score(self, proxy: str) -> float:
"""Calculate proxy score (lower is better)"""
stats = self.stats[proxy]
# New proxies get priority
if stats.total_requests == 0:
return 0
# Scoring factors
failure_rate = 1 - stats.success_rate
avg_response_time = stats.average_response_time
recency_penalty = 0
# Add penalty for recently used proxies
if stats.last_used:
time_since_use = (datetime.now() - stats.last_used).total_seconds()
if time_since_use < self.min_delay:
recency_penalty = 1000 # High penalty for too recent use
else:
recency_penalty = max(0, 10 - time_since_use / 60) # Decay over time
# Combined score (weighted)
score = (
failure_rate * 100 +
avg_response_time * 10 +
recency_penalty +
stats.consecutive_failures * 50
)
return score
def get_proxy(self, retry_blacklisted: bool = True) -> Optional[str]:
"""Get the best available proxy"""
# Check for proxies that can be un-blacklisted
if retry_blacklisted:
self._check_blacklist()
# Clean up the queue and rebuild if necessary
if len(self.proxy_queue) < len(self.proxies) * 0.5:
self._rebuild_queue()
while self.proxy_queue:
score, timestamp, proxy = heapq.heappop(self.proxy_queue)
# Skip if blacklisted
if proxy in self.blacklist:
continue
# Check minimum delay
stats = self.stats[proxy]
if stats.last_used:
time_since_use = (datetime.now() - stats.last_used).total_seconds()
if time_since_use < self.min_delay:
# Re-add to queue with updated score
new_score = self._calculate_proxy_score(proxy)
heapq.heappush(self.proxy_queue, (new_score, time.time(), proxy))
continue
# Update last used time
stats.last_used = datetime.now()
return proxy
return None
def _rebuild_queue(self):
"""Rebuild the priority queue with updated scores"""
self.proxy_queue = []
for proxy in self.proxies:
if proxy not in self.blacklist:
score = self._calculate_proxy_score(proxy)
heapq.heappush(self.proxy_queue, (score, time.time(), proxy))
def _check_blacklist(self):
"""Remove proxies from blacklist after cooldown"""
now = datetime.now()
to_remove = []
for proxy, blacklist_time in self.blacklist.items():
if now - blacklist_time > self.cooldown_period:
to_remove.append(proxy)
# Reset consecutive failures
self.stats[proxy].consecutive_failures = 0
for proxy in to_remove:
del self.blacklist[proxy]
# Re-add to queue
score = self._calculate_proxy_score(proxy)
heapq.heappush(self.proxy_queue, (score, time.time(), proxy))
def record_success(self, proxy: str, response_time: float):
"""Record successful request"""
stats = self.stats[proxy]
stats.total_requests += 1
stats.successful_requests += 1
stats.total_response_time += response_time
stats.consecutive_failures = 0
# Re-add to queue with updated score
score = self._calculate_proxy_score(proxy)
heapq.heappush(self.proxy_queue, (score, time.time(), proxy))
def record_failure(self, proxy: str, permanent: bool = False):
"""Record failed request"""
stats = self.stats[proxy]
stats.total_requests += 1
stats.failed_requests += 1
stats.consecutive_failures += 1
stats.last_failed = datetime.now()
if permanent or stats.consecutive_failures >= self.max_failures:
# Add to blacklist
self.blacklist[proxy] = datetime.now()
print(f"Blacklisted proxy: {proxy} (failures: {stats.consecutive_failures})")
else:
# Re-add to queue with updated score
score = self._calculate_proxy_score(proxy)
heapq.heappush(self.proxy_queue, (score, time.time(), proxy))
def get_stats_summary(self) -> Dict:
"""Get summary of all proxy statistics"""
active_proxies = [p for p in self.proxies if p not in self.blacklist]
summary = {
"total_proxies": len(self.proxies),
"active_proxies": len(active_proxies),
"blacklisted_proxies": len(self.blacklist),
"top_performers": [],
"worst_performers": []
}
# Sort by success rate and response time
proxy_scores = []
for proxy in active_proxies:
stats = self.stats[proxy]
if stats.total_requests > 0:
proxy_scores.append({
"proxy": proxy,
"success_rate": stats.success_rate,
"avg_response_time": stats.average_response_time,
"total_requests": stats.total_requests
})
# Sort by success rate (descending) and response time (ascending)
proxy_scores.sort(
key=lambda x: (-x["success_rate"], x["avg_response_time"])
)
summary["top_performers"] = proxy_scores[:5]
summary["worst_performers"] = proxy_scores[-5:] if len(proxy_scores) > 5 else []
return summary
def export_stats(self, filename: str = "proxy_stats.json"):
"""Export detailed statistics to file"""
export_data = {
"timestamp": datetime.now().isoformat(),
"summary": self.get_stats_summary(),
"detailed_stats": {}
}
for proxy, stats in self.stats.items():
export_data["detailed_stats"][proxy] = {
"total_requests": stats.total_requests,
"successful_requests": stats.successful_requests,
"failed_requests": stats.failed_requests,
"success_rate": stats.success_rate,
"average_response_time": stats.average_response_time,
"consecutive_failures": stats.consecutive_failures,
"is_blacklisted": proxy in self.blacklist
}
with open(filename, 'w') as f:
json.dump(export_data, f, indent=2)
# Usage example with async requests
async def scrape_with_smart_rotation(urls: List[str], proxies: List[str]):
rotator = DecodoproxyRotator(
proxies,
max_failures=3,
cooldown_minutes=30,
min_delay_between_uses=2.0
)
async def fetch_url(session: aiohttp.ClientSession, url: str) -> Optional[str]:
proxy = rotator.get_proxy()
if not proxy:
print("No available proxies!")
return None
proxy_url = f"http://{proxy}"
start_time = time.time()
try:
async with session.get(
url,
proxy=proxy_url,
timeout=aiohttp.ClientTimeout(total=10)
) as response:
if response.status == 200:
content = await response.text()
response_time = time.time() - start_time
rotator.record_success(proxy, response_time)
return content
else:
rotator.record_failure(proxy)
return None
except Exception as e:
print(f"Request failed with proxy {proxy}: {str(e)}")
rotator.record_failure(proxy)
return None
# Create session and fetch URLs
async with aiohttp.ClientSession() as session:
tasks = []
for url in urls:
task = fetch_url(session, url)
tasks.append(task)
# Add small delay between requests
await asyncio.sleep(0.1)
results = await asyncio.gather(*tasks)
# Print statistics
summary = rotator.get_stats_summary()
print(f"\nScraping completed:")
print(f"Active proxies: {summary['active_proxies']}/{summary['total_proxies']}")
print(f"Blacklisted: {summary['blacklisted_proxies']}")
print("\nTop performers:")
for proxy in summary['top_performers'][:3]:
print(f" {proxy['proxy']}: {proxy['success_rate']:.1%} success, "
f"{proxy['avg_response_time']:.2f}s avg")
# Export detailed stats
rotator.export_stats()
return results
# Run the scraper
# urls = ["https://example.com/page1", "https://example.com/page2", ...]
# proxies = ["1.2.3.4:8080", "5.6.7.8:3128", ...]
# results = asyncio.run(scrape_with_smart_rotation(urls, proxies))
Free vs. Paid Proxies: Making the Right Choice
When Free Proxies Work
Free proxies are suitable for:
- Learning and experimentation
- Small-scale personal projects
- Testing scraping logic
- Infrequent data collection
- Non-critical applications
When to Upgrade to Paid Services
Consider paid proxies when you need:
- Reliability: 99.9% uptime guarantees
- Speed: Dedicated bandwidth for fast scraping
- Scale: Thousands of concurrent connections
- Support: Technical assistance and SLA agreements
- Legal compliance: Proper proxy sourcing and documentation
- Advanced features: Residential IPs, mobile proxies, sticky sessions
Best Practices for Using Free Proxies
- Always validate proxies before use - Check connectivity and anonymity
- Implement retry logic - Handle failed requests gracefully
- Respect rate limits - Even with proxies, don't overwhelm target servers
- Monitor proxy health - Track success rates and remove bad proxies
- Use HTTPS proxies - Ensure data security during transmission
- Rotate user agents - Combine proxy rotation with header randomization
- Keep backup lists - Multiple proxy sources prevent complete failures
Security Considerations
When using free proxies, be aware of potential risks:
- Data interception: Free proxies may log or modify your traffic
- Malware injection: Some proxies inject malicious scripts
- Credential theft: Never send sensitive data through untrusted proxies
- Legal liability: Ensure proxies aren't sourced from botnets
Security checklist:
def is_proxy_safe(proxy):
"""Basic security checks for proxies"""
checks = {
'supports_https': test_https_support(proxy),
'no_header_injection': test_header_integrity(proxy),
'proper_anonymity': test_anonymity_level(proxy),
'reasonable_latency': test_response_time(proxy) < 5
}
return all(checks.values())
Conclusion
Free proxy lists provide an accessible entry point for web scraping projects. While they come with limitations—reliability issues, security concerns, and scalability constraints—they serve well for learning, testing, and small-scale applications.
For production environments or business-critical scraping, consider professional solutions like WebScraping.AI. With automated proxy management, guaranteed uptime, and built-in web scraping features, you can focus on extracting valuable data rather than maintaining proxy infrastructure.
Start with free proxies to validate your scraping logic, then scale up with reliable paid services as your needs grow. The time saved managing proxy lists often justifies the investment in professional tools.