To extract scripts or stylesheets using Beautiful Soup, you would typically search for <script>
or <link>
tags within the HTML document. Here's a step-by-step guide on how to do this in Python:
Prerequisites
Before you start, make sure you have Beautiful Soup and a parser library (like lxml
or html.parser
) installed. You can install Beautiful Soup using pip:
pip install beautifulsoup4
Optionally, install lxml
for faster parsing:
pip install lxml
Extracting Script Tags
The following Python example demonstrates how to extract all JavaScript code blocks from a webpage using Beautiful Soup:
from bs4 import BeautifulSoup
import requests
# Fetch the webpage
url = 'http://example.com'
response = requests.get(url)
html_content = response.text
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(html_content, 'lxml') # or 'html.parser'
# Find all script tags
scripts = soup.find_all('script')
# Iterate over script tags and get the content or the src attribute
for script in scripts:
# If script tag contains inline JavaScript
if script.string is not None:
print(script.string)
# If script tag references an external JavaScript file
elif script.get('src'):
print(f"External script found: {script.get('src')}")
Extracting Stylesheets
Similarly, to extract all linked stylesheets, you would look for <link>
tags with a rel
attribute of stylesheet
. Here's an example of how to do this:
from bs4 import BeautifulSoup
import requests
# Fetch the webpage
url = 'http://example.com'
response = requests.get(url)
html_content = response.text
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(html_content, 'lxml') # or 'html.parser'
# Find all link tags that link to stylesheets
stylesheets = soup.find_all('link', rel='stylesheet')
# Iterate over link tags and get the href attribute
for link in stylesheets:
print(f"Stylesheet URL: {link.get('href')}")
Notes
- When extracting inline script or style content, the
string
property of a Beautiful Soup tag will return the text within the tag. - When extracting external resources, use the
src
attribute for scripts and thehref
attribute for stylesheets. - In real-world scenarios, webpages may include scripts or stylesheets in ways that are not covered by this simple example. For instance, scripts may be loaded dynamically with JavaScript, or styles may be applied inline within
style
attributes on individual elements. - Always respect the terms of service and
robots.txt
of the website you are scraping from. Some websites explicitly disallow scraping, and extracting content from them could be against their terms of service.
This guide should give you a good starting point for extracting scripts and stylesheets from web pages using Beautiful Soup in Python.