Beautiful Soup: Build a Web Scraper with Python

Web scraping is an essential skill for developers in 2025, enabling automated data extraction from websites for various applications. In this comprehensive guide, we'll build a Python web scraper using Beautiful Soup and pycURL, two powerful libraries that make web scraping efficient and straightforward.

Prerequisites

Before we begin, ensure you have:

Python 3.7 or higher installed
Basic knowledge of Python programming
Understanding of HTML structure
A terminal or command prompt

Why Use Web Scrapers?

Web scrapers have become indispensable tools across various industries:

Market Research: Extract competitor pricing, product information, and customer reviews to analyze market trends and make data-driven decisions
Price Monitoring: Track product prices across e-commerce platforms to optimize pricing strategies and identify opportunities
News Aggregation: Collect news articles from multiple sources for sentiment analysis or content curation
Lead Generation: Gather business contact information from directories and professional networks
Academic Research: Collect data for research papers, studies, and statistical analysis
Real Estate Analysis: Monitor property listings, prices, and market trends
Job Market Intelligence: Track job postings, salary trends, and skill requirements

Understanding cURL

cURL (Client URL) is a command-line tool and library for transferring data using various protocols. While modern Python developers often use libraries like requests, cURL remains popular due to its:

Speed: Extremely fast and lightweight
Versatility: Supports multiple protocols (HTTP, HTTPS, FTP, etc.)
Features: Built-in support for cookies, authentication, and proxy servers
Cross-platform: Works on Windows, macOS, and Linux

Let's test cURL with a simple command:

curl https://httpbin.org/get

This command sends a GET request and displays the response. You should see JSON output showing your request details:

{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Host": "httpbin.org",
    "User-Agent": "curl/7.x.x"
  },
  "origin": "your.ip.address",
  "url": "https://httpbin.org/get"
}

Why pycURL for Python?

While Python offers several HTTP libraries like requests and urllib, pycURL provides unique advantages:

Performance: Direct bindings to libcurl make it faster than pure Python implementations
Advanced Features: Support for parallel requests, HTTP/2, and complex authentication
Memory Efficiency: Lower memory footprint for large-scale scraping

To install pycURL, use pip:

pip install pycurl

Note: On Windows, you might need to install pre-compiled wheels:

pip install pycurl --no-compile

Beautiful Soup: The HTML Parser

Beautiful Soup is Python's most popular HTML/XML parsing library because it:

Handles Broken HTML: Gracefully parses malformed HTML
Intuitive API: Simple methods for navigating and searching the parse tree
Multiple Parser Support: Works with html.parser, lxml, and html5lib
Encoding Detection: Automatically handles different character encodings

Install Beautiful Soup 4:

pip install beautifulsoup4

For better performance, also install lxml:

pip install lxml

Setting Up Your Project

Let's create a well-structured project for our web scraper:

1. Create Project Directory

mkdir python-web-scraper
cd python-web-scraper

2. Set Up Virtual Environment

# Create virtual environment
python3 -m venv venv

# Activate it (Linux/Mac)
source venv/bin/activate

# Activate it (Windows)
venv\Scripts\activate

3. Install Dependencies

pip install pycurl beautifulsoup4 certifi lxml

4. Create Project Structure

touch scraper.py requirements.txt

Save your dependencies:

pip freeze > requirements.txt

Now let's create our web scraper. Create a file named scraper.py and add the following imports:

import pycurl
import certifi
from io import BytesIO
from bs4 import BeautifulSoup
import json
from urllib.parse import urlencode

Note: There's a typo in many tutorials - it should be certifi not certify.

Building the Web Scraper

Let's build our scraper step by step, starting with a simple example and then adding more features.

Basic Data Fetching with pycURL

First, let's create a basic function to fetch HTML content:

def fetch_html(url, timeout=30):
    """
    Fetch HTML content from a URL using pycURL

    Args:
        url (str): The URL to scrape
        timeout (int): Request timeout in seconds

    Returns:
        str: HTML content
    """
    buffer = BytesIO()
    curl = pycurl.Curl()

    try:
        # Set URL
        curl.setopt(curl.URL, url)

        # Set buffer to store response
        curl.setopt(curl.WRITEDATA, buffer)

        # Use certifi for SSL certificate verification
        curl.setopt(curl.CAINFO, certifi.where())

        # Set timeout
        curl.setopt(curl.TIMEOUT, timeout)

        # Follow redirects
        curl.setopt(curl.FOLLOWLOCATION, True)

        # Set user agent to avoid blocks
        curl.setopt(curl.USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')

        # Perform request
        curl.perform()

        # Get HTTP status code
        status_code = curl.getinfo(curl.RESPONSE_CODE)

        if status_code != 200:
            raise Exception(f"HTTP {status_code} error")

    except pycurl.error as e:
        raise Exception(f"cURL error: {e}")
    finally:
        curl.close()

    # Decode response
    body = buffer.getvalue()

    # Try to detect encoding
    try:
        data = body.decode('utf-8')
    except UnicodeDecodeError:
        data = body.decode('iso-8859-1')

    return data

# Example usage
if __name__ == "__main__":
    url = 'https://httpbin.org/html'
    html_content = fetch_html(url)
    print(f"Fetched {len(html_content)} characters")

Understanding the Code

Buffer Creation: BytesIO() creates an in-memory buffer to store the response
SSL Verification: certifi.where() provides CA certificates for HTTPS requests
User Agent: Setting a browser-like user agent helps avoid blocks
Error Handling: Proper exception handling for network errors
Encoding Detection: Attempts UTF-8 first, falls back to ISO-8859-1

When you run this code, you'll see output like:

Fetched 3741 characters

Parsing HTML with Beautiful Soup

Now that we can fetch HTML, let's parse it to extract meaningful data. Beautiful Soup provides powerful methods for navigating and searching the HTML tree.

Basic Parsing Example

Let's create a function to parse HTML and extract specific elements:

def parse_html(html_content):
    """
    Parse HTML content using Beautiful Soup

    Args:
        html_content (str): Raw HTML string

    Returns:
        BeautifulSoup: Parsed HTML object
    """
    # Use lxml parser for better performance
    soup = BeautifulSoup(html_content, 'lxml')
    return soup

def extract_text_from_tags(soup, tag_name):
    """
    Extract text from all instances of a specific tag

    Args:
        soup (BeautifulSoup): Parsed HTML
        tag_name (str): HTML tag to extract

    Returns:
        list: Text content from all matching tags
    """
    elements = soup.find_all(tag_name)
    return [elem.text.strip() for elem in elements]

# Example usage
html_content = fetch_html('https://example.com')
soup = parse_html(html_content)

# Extract all paragraph text
paragraphs = extract_text_from_tags(soup, 'p')
for p in paragraphs:
    print(p)

Advanced Parsing Techniques

Beautiful Soup offers many ways to find and extract data:

# 1. Find by CSS class
elements = soup.find_all('div', class_='product-item')

# 2. Find by ID
element = soup.find('div', id='main-content')

# 3. CSS selectors
elements = soup.select('.price > span')

# 4. Find by attributes
links = soup.find_all('a', href=True)

# 5. Search by text
elements = soup.find_all(text='Add to Cart')

# 6. Navigate the tree
for child in soup.body.children:
    if child.name:
        print(child.name)

Complete Web Scraper Example

Let's build a complete web scraper that extracts product information from a sample e-commerce page:

import pycurl
import certifi
from io import BytesIO
from bs4 import BeautifulSoup
import json
import csv
from datetime import datetime

class WebScraper:
    def __init__(self, user_agent=None):
        self.user_agent = user_agent or 'Mozilla/5.0 (compatible; WebScraper/1.0)'

    def fetch_html(self, url, timeout=30):
        """Fetch HTML content from URL"""
        buffer = BytesIO()
        curl = pycurl.Curl()

        try:
            curl.setopt(curl.URL, url)
            curl.setopt(curl.WRITEDATA, buffer)
            curl.setopt(curl.CAINFO, certifi.where())
            curl.setopt(curl.TIMEOUT, timeout)
            curl.setopt(curl.FOLLOWLOCATION, True)
            curl.setopt(curl.USERAGENT, self.user_agent)
            curl.perform()

            status_code = curl.getinfo(curl.RESPONSE_CODE)
            if status_code != 200:
                raise Exception(f"HTTP {status_code} error")

        except pycurl.error as e:
            raise Exception(f"cURL error: {e}")
        finally:
            curl.close()

        body = buffer.getvalue()
        return body.decode('utf-8', errors='ignore')

    def parse_products(self, html):
        """Extract product information from HTML"""
        soup = BeautifulSoup(html, 'lxml')
        products = []

        # Example: Find all product containers
        product_divs = soup.find_all('div', class_='product')

        for product_div in product_divs:
            product = {}

            # Extract title
            title_elem = product_div.find('h2', class_='product-title')
            product['title'] = title_elem.text.strip() if title_elem else 'N/A'

            # Extract price
            price_elem = product_div.find('span', class_='price')
            if price_elem:
                price_text = price_elem.text.strip()
                # Clean price (remove currency symbols, etc.)
                product['price'] = price_text.replace('$', '').replace(',', '')
            else:
                product['price'] = 'N/A'

            # Extract description
            desc_elem = product_div.find('p', class_='description')
            product['description'] = desc_elem.text.strip() if desc_elem else 'N/A'

            # Extract image URL
            img_elem = product_div.find('img')
            product['image_url'] = img_elem.get('src', 'N/A') if img_elem else 'N/A'

            # Extract product URL
            link_elem = product_div.find('a', href=True)
            product['url'] = link_elem['href'] if link_elem else 'N/A'

            # Add timestamp
            product['scraped_at'] = datetime.now().isoformat()

            products.append(product)

        return products

    def save_to_csv(self, products, filename='products.csv'):
        """Save products to CSV file"""
        if not products:
            print("No products to save")
            return

        keys = products[0].keys()

        with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=keys)
            writer.writeheader()
            writer.writerows(products)

        print(f"Saved {len(products)} products to {filename}")

    def save_to_json(self, products, filename='products.json'):
        """Save products to JSON file"""
        with open(filename, 'w', encoding='utf-8') as jsonfile:
            json.dump(products, jsonfile, indent=2, ensure_ascii=False)

        print(f"Saved {len(products)} products to {filename}")

# Example usage
if __name__ == "__main__":
    # Initialize scraper
    scraper = WebScraper()

    # Example URL (replace with actual URL)
    url = 'https://scrapingbee.com/blog/python-html-parser/'

    try:
        # Fetch HTML
        print(f"Fetching {url}...")
        html = scraper.fetch_html(url)

        # Parse products
        products = scraper.parse_products(html)

        # Save results
        if products:
            scraper.save_to_csv(products)
            scraper.save_to_json(products)
        else:
            print("No products found")

    except Exception as e:
        print(f"Error: {e}")

Best Practices for Web Scraping

1. Respect robots.txt

Always check the website's robots.txt file before scraping:

def check_robots_txt(domain):
    robots_url = f"{domain}/robots.txt"
    try:
        response = fetch_html(robots_url)
        print(response)
    except:
        print("No robots.txt found")

2. Add Delays Between Requests

Avoid overwhelming servers:

import time
import random

# Add random delay between requests
time.sleep(random.uniform(1, 3))

3. Handle Errors Gracefully

def safe_scrape(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            return fetch_html(url)
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
    return None

4. Use Session Management

For sites requiring login:

def create_session():
    """Create a cURL session with cookie support"""
    curl = pycurl.Curl()
    curl.setopt(curl.COOKIEJAR, 'cookies.txt')
    curl.setopt(curl.COOKIEFILE, 'cookies.txt')
    return curl

Advanced Features

Handling JavaScript-Rendered Content

For websites that load content dynamically with JavaScript, consider:

Selenium: For full browser automation
Playwright: Modern alternative to Selenium
Requests-HTML: Lightweight JavaScript support

Parallel Scraping

For faster scraping of multiple pages:

import concurrent.futures

def scrape_multiple_urls(urls, max_workers=5):
    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {executor.submit(fetch_html, url): url for url in urls}

        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                html = future.result()
                results.append((url, html))
            except Exception as e:
                print(f"Failed to scrape {url}: {e}")

    return results

Common Challenges and Solutions

1. Handling Different Encodings

def detect_encoding(content):
    """Detect content encoding"""
    import chardet
    result = chardet.detect(content)
    return result['encoding']

2. Dealing with Anti-Scraping Measures

Rotate User Agents: Use different browser identities
Use Proxies: Distribute requests across IP addresses
Respect Rate Limits: Add delays and implement backoff strategies
Handle CAPTCHAs: Consider using CAPTCHA-solving services when legal

3. Data Cleaning

def clean_text(text):
    """Clean extracted text"""
    # Remove extra whitespace
    text = ' '.join(text.split())
    # Remove special characters
    text = text.encode('ascii', 'ignore').decode('ascii')
    return text.strip()

Conclusion

You've now learned how to build a robust web scraper using Python, pycURL, and Beautiful Soup. This combination provides excellent performance and flexibility for most web scraping tasks. Remember to:

Always respect website terms of service and robots.txt
Implement proper error handling and retries
Add delays between requests to avoid overwhelming servers
Consider using web scraping APIs for complex sites
Keep your scraper maintainable with clean, modular code

Web scraping is a powerful tool for data collection, but use it responsibly. Happy scraping!

Table of contents