Using lxml
to follow links and scrape multiple pages involves several steps. The lxml
library allows you to parse HTML and XML documents in Python, and you can combine it with the requests
library to perform HTTP requests to retrieve the content of web pages.
Here's a step-by-step guide on how to accomplish this task:
Step 1: Install the Necessary Libraries
First, ensure you have lxml
and requests
installed. You can install them using pip
:
pip install lxml requests
Step 2: Fetch the Initial Page
You'll start by fetching the content of the initial page using the requests
library.
import requests
from lxml import html
url = "http://example.com"
response = requests.get(url)
Step 3: Parse the Page Content
After fetching the page, you can parse it using lxml.html
.
parsed_page = html.fromstring(response.content)
Step 4: Extract Links
Extract the links you want to follow. You'll typically look for <a>
tags and their href
attributes.
# Find all links on the page
links = parsed_page.xpath('//a/@href') # This will give you a list of href attributes
# Optionally, filter links based on a condition
links_to_follow = [link for link in links if "condition" in link]
Step 5: Follow Each Link and Scrape
Now loop through each link, make a request to the new URL, parse it, and extract the information you need.
base_url = "http://example.com"
for link in links_to_follow:
# Make sure the link is absolute
link = base_url + link if not link.startswith('http') else link
# Fetch the content of the link
response = requests.get(link)
parsed_page = html.fromstring(response.content)
# Scrape data from this page
data = parsed_page.xpath('//div[@class="data"]//text()') # Adjust the XPath to your needs
# Do something with the data (e.g., store it, print it, etc.)
print(data)
Example: Scrape Multiple Pages
Here's a full example that puts it all together, scraping the titles of articles from a blog's index page and then following each link to scrape the article content.
import requests
from lxml import html
def scrape_article(article_url):
response = requests.get(article_url)
article_page = html.fromstring(response.content)
article_title = article_page.xpath('//h1[@class="article-title"]/text()')[0]
article_content = article_page.xpath('//div[@class="article-content"]//text()')
return {
'title': article_title,
'content': article_content
}
# URL of the blog index page
index_url = "http://example.com/blog"
response = requests.get(index_url)
index_page = html.fromstring(response.content)
# Extract article links
article_links = index_page.xpath('//div[@class="article-list"]//a/@href')
# Scrape each article
articles = []
for link in article_links:
full_link = index_url + link if not link.startswith('http') else link
article_data = scrape_article(full_link)
articles.append(article_data)
# Now you have a list of articles with their titles and content
for article in articles:
print(article['title'])
# ... do something with the article data ...
Remember to respect the robots.txt
of the website and consider the legality and ethical implications of scraping. Some websites may not allow scraping, and heavy traffic from your scraping activities could impact the website's performance. Always scrape responsibly.