Yes, it is possible to scrape HTTPS websites using urllib3
. The urllib3
library is a powerful HTTP client for Python that can handle both HTTP and HTTPS requests. When dealing with HTTPS, urllib3
can handle SSL/TLS verification and encryption, making it suitable for scraping secure websites.
However, to scrape HTTPS websites with urllib3
, you need to ensure that you are handling SSL certificate verification properly. By default, urllib3
will warn you if you try to perform an HTTPS request without certificate verification. It is recommended to perform certificate verification for security reasons unless you are scraping a website with a self-signed or invalid certificate and you trust the source.
Here is an example of how you can scrape an HTTPS website using urllib3
with SSL certificate verification:
import urllib3
from urllib3.util import Retry
from urllib3.util.timeout import Timeout
# Create a PoolManager instance
http = urllib3.PoolManager()
# Set up retries and timeout
retries = Retry(connect=5, read=2, redirect=5)
timeout = Timeout(connect=10.0, read=7.0)
# Perform an HTTPS request with certificate verification
try:
response = http.request(
'GET',
'https://example.com',
retries=retries,
timeout=timeout
)
# Check if the request was successful
if response.status == 200:
# Process the response
html = response.data.decode('utf-8')
print(html) # Print the HTML content of the page
else:
print(f"Request failed with status code: {response.status}")
except urllib3.exceptions.SSLError as e:
print(f"SSL certificate verification failed: {e}")
# Always close the response to release the connection back to the pool
response.release_conn()
In the example above, we create a PoolManager
instance that manages a pool of connections and handles SSL certificate verification by default. We also set up retry logic and timeouts to make the scraping more robust.
If you need to scrape a website that has SSL issues (e.g., a self-signed certificate), you can disable SSL warnings and skip certificate verification (not recommended for production code):
import urllib3
# Disable SSL warnings (not recommended for production code)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Create a PoolManager instance with certificate verification disabled
http = urllib3.PoolManager(cert_reqs='CERT_NONE')
# Perform an HTTPS request without certificate verification
response = http.request('GET', 'https://self-signed.badssl.com')
# Process the response
html = response.data.decode('utf-8')
print(html) # Print the HTML content of the page
# Always close the response to release the connection back to the pool
response.release_conn()
Keep in mind that web scraping can be legally complex, and scraping a website without permission may violate the website's terms of service. Always respect robots.txt
and ensure that your scraping activities are compliant with relevant laws and website policies.