How do I use lxml to follow links and scrape multiple pages?

Using lxml to follow links and scrape multiple pages involves several steps. The lxml library allows you to parse HTML and XML documents in Python, and you can combine it with the requests library to perform HTTP requests to retrieve the content of web pages.

Here's a step-by-step guide on how to accomplish this task:

Step 1: Install the Necessary Libraries

First, ensure you have lxml and requests installed. You can install them using pip:

pip install lxml requests

Step 2: Fetch the Initial Page

You'll start by fetching the content of the initial page using the requests library.

import requests
from lxml import html

url = "http://example.com"
response = requests.get(url)

Step 3: Parse the Page Content

After fetching the page, you can parse it using lxml.html.

parsed_page = html.fromstring(response.content)

Step 4: Extract Links

Extract the links you want to follow. You'll typically look for <a> tags and their href attributes.

# Find all links on the page
links = parsed_page.xpath('//a/@href')  # This will give you a list of href attributes

# Optionally, filter links based on a condition
links_to_follow = [link for link in links if "condition" in link]

Step 5: Follow Each Link and Scrape

Now loop through each link, make a request to the new URL, parse it, and extract the information you need.

base_url = "http://example.com"
for link in links_to_follow:
    # Make sure the link is absolute
    link = base_url + link if not link.startswith('http') else link

    # Fetch the content of the link
    response = requests.get(link)
    parsed_page = html.fromstring(response.content)

    # Scrape data from this page
    data = parsed_page.xpath('//div[@class="data"]//text()')  # Adjust the XPath to your needs

    # Do something with the data (e.g., store it, print it, etc.)
    print(data)

Example: Scrape Multiple Pages

Here's a full example that puts it all together, scraping the titles of articles from a blog's index page and then following each link to scrape the article content.

import requests
from lxml import html

def scrape_article(article_url):
    response = requests.get(article_url)
    article_page = html.fromstring(response.content)
    article_title = article_page.xpath('//h1[@class="article-title"]/text()')[0]
    article_content = article_page.xpath('//div[@class="article-content"]//text()')
    return {
        'title': article_title,
        'content': article_content
    }

# URL of the blog index page
index_url = "http://example.com/blog"
response = requests.get(index_url)
index_page = html.fromstring(response.content)

# Extract article links
article_links = index_page.xpath('//div[@class="article-list"]//a/@href')

# Scrape each article
articles = []
for link in article_links:
    full_link = index_url + link if not link.startswith('http') else link
    article_data = scrape_article(full_link)
    articles.append(article_data)

# Now you have a list of articles with their titles and content
for article in articles:
    print(article['title'])
    # ... do something with the article data ...

Remember to respect the robots.txt of the website and consider the legality and ethical implications of scraping. Some websites may not allow scraping, and heavy traffic from your scraping activities could impact the website's performance. Always scrape responsibly.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon