Beautiful Soup is a Python library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser and provides Pythonic idioms for iterating and searching the parse tree.
To extract all image sources from a webpage using Beautiful Soup, you'll need to:
- Install Beautiful Soup and a parser library (like
lxml
orhtml.parser
). - Make a request to the webpage to get the HTML content.
- Parse the HTML content with Beautiful Soup.
- Find all the
<img>
tags. - Extract the
src
attribute from each<img>
tag.
Here's a step-by-step guide with code examples:
Step 1: Install Beautiful Soup and Requests
If you haven't already installed Beautiful Soup and requests
(a library to make HTTP requests), you can do so using pip
:
pip install beautifulsoup4 lxml requests
Step 2: Request the Webpage
Use the requests
library to fetch the content of the webpage:
import requests
url = 'http://example.com' # Replace with the actual URL
response = requests.get(url)
# Ensure we've got a successful response
if response.status_code == 200:
html_content = response.text
else:
print(f"Failed to retrieve the webpage: {response.status_code}")
Step 3: Parse the HTML Content
Parse the retrieved HTML content with Beautiful Soup:
from bs4 import BeautifulSoup
# Parse the HTML content using Beautiful Soup and the lxml parser
soup = BeautifulSoup(html_content, 'lxml') # You can also use 'html.parser'
Step 4: Find All <img>
Tags
Use Beautiful Soup to find all the <img>
tags in the parsed HTML:
# Find all <img> tags in the document
images = soup.find_all('img')
Step 5: Extract the src
Attribute
Loop through the found <img>
tags and extract the src
attribute, which contains the image source URL:
# List to store image URLs
image_urls = []
# Extract the 'src' attribute from each <img> tag
for img in images:
img_src = img.get('src')
if img_src:
# Optionally, resolve the URL if it's relative
img_url = requests.compat.urljoin(url, img_src)
image_urls.append(img_url)
# Print the list of image URLs
for img_url in image_urls:
print(img_url)
Complete Code Example
Putting it all together, here's the complete code that you can use to extract all image sources from a webpage:
import requests
from bs4 import BeautifulSoup
# The webpage URL
url = 'http://example.com' # Replace with the actual URL
# Send a GET request to the webpage
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.text, 'lxml') # or 'html.parser'
# Find all <img> tags
images = soup.find_all('img')
# List to store image URLs
image_urls = []
# Extract the 'src' attribute and resolve relative URLs
for img in images:
img_src = img.get('src')
if img_src:
img_url = requests.compat.urljoin(url, img_src)
image_urls.append(img_url)
# Print the image URLs
for img_url in image_urls:
print(img_url)
else:
print(f"Failed to retrieve the webpage: {response.status_code}")
Note: When you run this code, make sure you're allowed to scrape the website in question. Always check the website's robots.txt
file and terms of service to ensure compliance with their scraping policies. Additionally, websites can have relative URLs for images, so it's important to resolve them to absolute URLs, as shown in the code example above (requests.compat.urljoin
).