How do I use Python to extract metadata from web pages?

Extracting metadata from web pages in Python is commonly done using libraries that can parse HTML and extract information from the <meta> tags. One of the most popular libraries for this purpose is BeautifulSoup.

Here's a step-by-step guide on how you can use Python to extract metadata from web pages using BeautifulSoup:

Step 1: Install Required Libraries

Before starting, you need to have requests and beautifulsoup4 installed. You can install them using pip:

pip install requests beautifulsoup4

Step 2: Fetch the Web Page

Use the requests library to download the HTML content of the web page.

import requests

url = "http://example.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
else:
    print(f"Failed to retrieve the web page. Status code: {response.status_code}")

Step 3: Parse the HTML Content

Parse the HTML content using BeautifulSoup to create a soup object.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

Step 4: Extract Metadata

Extract the metadata by searching for <meta> tags within the soup object.

# Find all meta tags in the HTML
meta_tags = soup.find_all('meta')

# Create a dictionary to hold the metadata
metadata = {}

for tag in meta_tags:
    if 'name' in tag.attrs:
        name = tag.attrs['name']
        content = tag.attrs.get('content', '')
        metadata[name] = content
    elif 'property' in tag.attrs:  # For OpenGraph metadata
        property = tag.attrs['property']
        content = tag.attrs.get('content', '')
        metadata[property] = content

# Print the extracted metadata
for key, value in metadata.items():
    print(f"{key}: {value}")

Example:

Here is a complete example that extracts metadata from a web page:

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the web page
url = "http://example.com"
response = requests.get(url)

if response.status_code == 200:
    # Step 2: Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Step 3: Extract metadata
    meta_tags = soup.find_all('meta')
    metadata = {}
    for tag in meta_tags:
        if 'name' in tag.attrs:
            name = tag.attrs['name']
            content = tag.attrs.get('content', '')
            metadata[name] = content
        elif 'property' in tag.attrs:  # For OpenGraph metadata
            property = tag.attrs['property']
            content = tag.attrs.get('content', '')
            metadata[property] = content

    # Display the metadata
    for key, value in metadata.items():
        print(f"{key}: {value}")
else:
    print(f"Failed to retrieve the web page. Status code: {response.status_code}")

This example will print out all the metadata found in the <meta> tags of the specified web page.

Remember to respect the terms of service and robots.txt of the website that you are scraping, and be aware that heavy traffic to a website caused by scraping can be seen as a denial of service attack. Always scrape responsibly.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon