Extracting metadata from web pages in Python is commonly done using libraries that can parse HTML and extract information from the <meta>
tags. One of the most popular libraries for this purpose is BeautifulSoup
.
Here's a step-by-step guide on how you can use Python to extract metadata from web pages using BeautifulSoup
:
Step 1: Install Required Libraries
Before starting, you need to have requests
and beautifulsoup4
installed. You can install them using pip
:
pip install requests beautifulsoup4
Step 2: Fetch the Web Page
Use the requests
library to download the HTML content of the web page.
import requests
url = "http://example.com"
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
html_content = response.text
else:
print(f"Failed to retrieve the web page. Status code: {response.status_code}")
Step 3: Parse the HTML Content
Parse the HTML content using BeautifulSoup
to create a soup object.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
Step 4: Extract Metadata
Extract the metadata by searching for <meta>
tags within the soup object.
# Find all meta tags in the HTML
meta_tags = soup.find_all('meta')
# Create a dictionary to hold the metadata
metadata = {}
for tag in meta_tags:
if 'name' in tag.attrs:
name = tag.attrs['name']
content = tag.attrs.get('content', '')
metadata[name] = content
elif 'property' in tag.attrs: # For OpenGraph metadata
property = tag.attrs['property']
content = tag.attrs.get('content', '')
metadata[property] = content
# Print the extracted metadata
for key, value in metadata.items():
print(f"{key}: {value}")
Example:
Here is a complete example that extracts metadata from a web page:
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the web page
url = "http://example.com"
response = requests.get(url)
if response.status_code == 200:
# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Step 3: Extract metadata
meta_tags = soup.find_all('meta')
metadata = {}
for tag in meta_tags:
if 'name' in tag.attrs:
name = tag.attrs['name']
content = tag.attrs.get('content', '')
metadata[name] = content
elif 'property' in tag.attrs: # For OpenGraph metadata
property = tag.attrs['property']
content = tag.attrs.get('content', '')
metadata[property] = content
# Display the metadata
for key, value in metadata.items():
print(f"{key}: {value}")
else:
print(f"Failed to retrieve the web page. Status code: {response.status_code}")
This example will print out all the metadata found in the <meta>
tags of the specified web page.
Remember to respect the terms of service and robots.txt of the website that you are scraping, and be aware that heavy traffic to a website caused by scraping can be seen as a denial of service attack. Always scrape responsibly.