Table of contents

Can MechanicalSoup Follow Links Automatically?

Yes, MechanicalSoup can follow links automatically, though it requires programmatic implementation rather than being a built-in feature. MechanicalSoup provides powerful tools for navigating between pages, extracting links, and implementing automated crawling workflows. This capability makes it an excellent choice for web scraping projects that need to traverse multiple pages or follow specific navigation patterns.

Understanding Link Following in MechanicalSoup

MechanicalSoup doesn't have a single method that automatically follows all links on a page, but it provides the necessary components to build sophisticated link-following mechanisms. The library excels at:

  • Extracting links from web pages
  • Following specific links based on criteria
  • Maintaining session state across page transitions
  • Handling relative and absolute URLs
  • Managing cookies and authentication during navigation

Basic Link Following Implementation

Here's a fundamental example of how to follow links automatically with MechanicalSoup:

import mechanicalsoup
from urllib.parse import urljoin, urlparse
import time

# Create a MechanicalSoup browser instance
browser = mechanicalsoup.StatefulBrowser()

# Navigate to the starting page
start_url = "https://example.com"
browser.open(start_url)

def extract_links(page, base_url):
    """Extract all links from the current page"""
    links = []

    # Find all anchor tags with href attributes
    for link in page.select('a[href]'):
        href = link.get('href')
        if href:
            # Convert relative URLs to absolute URLs
            absolute_url = urljoin(base_url, href)
            links.append({
                'url': absolute_url,
                'text': link.get_text(strip=True),
                'title': link.get('title', '')
            })

    return links

def follow_links_automatically(browser, max_pages=10, delay=1):
    """Follow links automatically with configurable limits"""
    visited_urls = set()
    pages_crawled = 0

    while pages_crawled < max_pages:
        current_url = browser.get_url()

        if current_url in visited_urls:
            print(f"Already visited: {current_url}")
            break

        visited_urls.add(current_url)
        pages_crawled += 1

        print(f"Crawling page {pages_crawled}: {current_url}")

        # Extract links from current page
        current_page = browser.get_current_page()
        links = extract_links(current_page, current_url)

        # Filter links (example: only follow internal links)
        domain = urlparse(current_url).netloc
        internal_links = [
            link for link in links 
            if urlparse(link['url']).netloc == domain
        ]

        if internal_links:
            # Follow the first available internal link
            next_link = internal_links[0]
            print(f"Following link: {next_link['text']} -> {next_link['url']}")

            try:
                browser.open(next_link['url'])
                time.sleep(delay)  # Be respectful to the server
            except Exception as e:
                print(f"Error following link: {e}")
                break
        else:
            print("No more links to follow")
            break

    return visited_urls

# Start automatic link following
visited = follow_links_automatically(browser, max_pages=5, delay=2)
print(f"Crawled {len(visited)} pages")

Advanced Link Following with Filters

For more sophisticated crawling, you can implement filters and conditions for link selection:

import re
from urllib.parse import urlparse

class LinkFollower:
    def __init__(self, browser, base_domain=None):
        self.browser = browser
        self.base_domain = base_domain
        self.visited_urls = set()
        self.crawl_queue = []

    def add_link_filters(self, include_patterns=None, exclude_patterns=None):
        """Add regex patterns to filter links"""
        self.include_patterns = include_patterns or []
        self.exclude_patterns = exclude_patterns or []

    def should_follow_link(self, url):
        """Determine if a link should be followed based on filters"""
        # Check if already visited
        if url in self.visited_urls:
            return False

        # Check domain restrictions
        if self.base_domain:
            if urlparse(url).netloc != self.base_domain:
                return False

        # Apply include patterns
        if self.include_patterns:
            if not any(re.search(pattern, url) for pattern in self.include_patterns):
                return False

        # Apply exclude patterns
        if self.exclude_patterns:
            if any(re.search(pattern, url) for pattern in self.exclude_patterns):
                return False

        return True

    def extract_and_queue_links(self):
        """Extract links from current page and add to queue"""
        current_page = self.browser.get_current_page()
        current_url = self.browser.get_url()

        for link in current_page.select('a[href]'):
            href = link.get('href')
            if href:
                absolute_url = urljoin(current_url, href)
                if self.should_follow_link(absolute_url):
                    self.crawl_queue.append({
                        'url': absolute_url,
                        'text': link.get_text(strip=True),
                        'source_page': current_url
                    })

    def crawl_automatically(self, max_pages=50, delay=1):
        """Perform automatic crawling with queue management"""
        pages_crawled = 0

        # Add current page to queue if not already visited
        current_url = self.browser.get_url()
        if current_url not in self.visited_urls:
            self.crawl_queue.append({'url': current_url, 'text': 'Start Page'})

        while self.crawl_queue and pages_crawled < max_pages:
            # Get next link from queue
            next_link = self.crawl_queue.pop(0)
            url = next_link['url']

            if url in self.visited_urls:
                continue

            try:
                print(f"Crawling: {url}")
                self.browser.open(url)
                self.visited_urls.add(url)
                pages_crawled += 1

                # Extract new links from this page
                self.extract_and_queue_links()

                time.sleep(delay)

            except Exception as e:
                print(f"Error crawling {url}: {e}")
                continue

        return {
            'pages_crawled': pages_crawled,
            'visited_urls': list(self.visited_urls),
            'remaining_queue': len(self.crawl_queue)
        }

# Usage example
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://example.com")

crawler = LinkFollower(browser, base_domain="example.com")
crawler.add_link_filters(
    include_patterns=[r'/articles/', r'/blog/'],
    exclude_patterns=[r'\.pdf$', r'/admin/', r'/login']
)

results = crawler.crawl_automatically(max_pages=20, delay=2)
print(f"Crawling completed: {results}")

Following Specific Link Types

You can also target specific types of links for automatic following:

def follow_pagination_links(browser, max_pages=10):
    """Automatically follow pagination links"""
    page_count = 0

    while page_count < max_pages:
        current_page = browser.get_current_page()

        # Look for common pagination patterns
        next_link = (
            current_page.select_one('a[rel="next"]') or
            current_page.select_one('a:contains("Next")') or
            current_page.select_one('.pagination .next a') or
            current_page.select_one('[aria-label*="next" i] a')
        )

        if not next_link or not next_link.get('href'):
            print("No more pages to follow")
            break

        next_url = urljoin(browser.get_url(), next_link['href'])
        print(f"Following pagination to: {next_url}")

        try:
            browser.open(next_url)
            page_count += 1
            time.sleep(1)
        except Exception as e:
            print(f"Error following pagination: {e}")
            break

    return page_count

def follow_category_links(browser, category_selector):
    """Follow links within specific categories"""
    current_page = browser.get_current_page()
    category_links = current_page.select(category_selector)

    followed_links = []

    for link in category_links:
        href = link.get('href')
        if href:
            absolute_url = urljoin(browser.get_url(), href)

            try:
                # Open link in new browser instance to preserve state
                temp_browser = mechanicalsoup.StatefulBrowser()
                temp_browser.open(absolute_url)

                followed_links.append({
                    'url': absolute_url,
                    'title': link.get_text(strip=True),
                    'content': temp_browser.get_current_page()
                })

                time.sleep(0.5)

            except Exception as e:
                print(f"Error following category link {absolute_url}: {e}")

    return followed_links

JavaScript Alternative for Client-Side Navigation

For web applications that heavily rely on JavaScript, you might need to implement client-side link following:

// JavaScript implementation for browser automation
class AutomaticLinkFollower {
    constructor(options = {}) {
        this.maxPages = options.maxPages || 10;
        this.delay = options.delay || 1000;
        this.visitedUrls = new Set();
        this.includePatterns = options.includePatterns || [];
        this.excludePatterns = options.excludePatterns || [];
    }

    extractLinks() {
        const links = Array.from(document.querySelectorAll('a[href]'));
        return links
            .map(link => ({
                url: new URL(link.href, window.location.href).href,
                text: link.textContent.trim(),
                element: link
            }))
            .filter(link => this.shouldFollowLink(link.url));
    }

    shouldFollowLink(url) {
        // Check if already visited
        if (this.visitedUrls.has(url)) return false;

        // Check include patterns
        if (this.includePatterns.length > 0) {
            if (!this.includePatterns.some(pattern => new RegExp(pattern).test(url))) {
                return false;
            }
        }

        // Check exclude patterns
        if (this.excludePatterns.length > 0) {
            if (this.excludePatterns.some(pattern => new RegExp(pattern).test(url))) {
                return false;
            }
        }

        return true;
    }

    async followLinksAutomatically() {
        let pagesVisited = 0;

        while (pagesVisited < this.maxPages) {
            const currentUrl = window.location.href;

            if (this.visitedUrls.has(currentUrl)) {
                console.log('Already visited this page, stopping');
                break;
            }

            this.visitedUrls.add(currentUrl);
            pagesVisited++;

            console.log(`Visiting page ${pagesVisited}: ${currentUrl}`);

            // Extract links from current page
            const links = this.extractLinks();

            if (links.length === 0) {
                console.log('No more links to follow');
                break;
            }

            // Follow the first available link
            const nextLink = links[0];
            console.log(`Following link: ${nextLink.text} -> ${nextLink.url}`);

            // Navigate to the next page
            window.location.href = nextLink.url;

            // Wait for page load
            await new Promise(resolve => setTimeout(resolve, this.delay));
            break; // Exit as navigation will reload the page
        }

        return Array.from(this.visitedUrls);
    }
}

// Usage example
const follower = new AutomaticLinkFollower({
    maxPages: 5,
    delay: 2000,
    includePatterns: ['/blog/', '/articles/'],
    excludePatterns: ['\\.pdf$', '/admin/']
});

follower.followLinksAutomatically();

Integration with Session Management

When implementing automatic link following, it's crucial to maintain proper session management, especially when dealing with authenticated areas:

def authenticated_crawl(browser, login_url, username, password):
    """Perform authenticated crawling with automatic link following"""

    # Login first
    browser.open(login_url)
    browser.select_form('form[action*="login"]')
    browser['username'] = username
    browser['password'] = password
    browser.submit_selected()

    # Verify login success
    if "dashboard" not in browser.get_url().lower():
        raise Exception("Login failed")

    # Now crawl authenticated pages
    crawler = LinkFollower(browser)
    crawler.add_link_filters(
        include_patterns=[r'/dashboard/', r'/profile/'],
        exclude_patterns=[r'/logout', r'/delete']
    )

    return crawler.crawl_automatically(max_pages=15)

Error Handling and Robustness

Implement comprehensive error handling for robust automatic link following:

def robust_link_following(browser, start_url, max_pages=20):
    """Robust link following with comprehensive error handling"""

    class CrawlStats:
        def __init__(self):
            self.pages_crawled = 0
            self.errors_encountered = 0
            self.timeout_errors = 0
            self.http_errors = 0

    stats = CrawlStats()
    visited_urls = set()
    failed_urls = set()

    try:
        browser.open(start_url)
    except Exception as e:
        print(f"Failed to open start URL {start_url}: {e}")
        return stats

    while stats.pages_crawled < max_pages:
        try:
            current_url = browser.get_url()

            if current_url in visited_urls:
                break

            visited_urls.add(current_url)
            stats.pages_crawled += 1

            print(f"Crawling page {stats.pages_crawled}: {current_url}")

            # Extract links safely
            try:
                current_page = browser.get_current_page()
                links = extract_links(current_page, current_url)
            except Exception as e:
                print(f"Error extracting links from {current_url}: {e}")
                stats.errors_encountered += 1
                break

            # Find next valid link
            next_url = None
            for link in links:
                if (link['url'] not in visited_urls and 
                    link['url'] not in failed_urls):
                    next_url = link['url']
                    break

            if not next_url:
                print("No more valid links to follow")
                break

            # Navigate to next page with error handling
            try:
                browser.open(next_url)
                time.sleep(1)  # Rate limiting

            except requests.exceptions.Timeout:
                print(f"Timeout error for {next_url}")
                failed_urls.add(next_url)
                stats.timeout_errors += 1
                continue

            except requests.exceptions.HTTPError as e:
                print(f"HTTP error for {next_url}: {e}")
                failed_urls.add(next_url)
                stats.http_errors += 1
                continue

            except Exception as e:
                print(f"Unexpected error for {next_url}: {e}")
                failed_urls.add(next_url)
                stats.errors_encountered += 1
                continue

        except KeyboardInterrupt:
            print("Crawling interrupted by user")
            break

        except Exception as e:
            print(f"Critical error during crawling: {e}")
            stats.errors_encountered += 1
            break

    return {
        'stats': stats,
        'visited_urls': list(visited_urls),
        'failed_urls': list(failed_urls)
    }

Best Practices for Automatic Link Following

When implementing automatic link following with MechanicalSoup, consider these best practices:

  1. Respect Rate Limits: Always include delays between requests to avoid overwhelming servers
  2. Handle Errors Gracefully: Implement proper exception handling for network issues
  3. Set Crawl Limits: Define maximum page limits to prevent infinite crawling
  4. Filter Links Intelligently: Use domain restrictions and pattern matching to stay focused
  5. Maintain State: Preserve cookies and session data across page transitions
  6. Monitor Progress: Log crawling activity for debugging and monitoring
  7. Implement Backoff Strategies: Use exponential backoff for failed requests
  8. Validate URLs: Check URL structure before attempting to follow links
  9. Respect robots.txt: Check and respect website crawling policies
  10. Use Appropriate User Agents: Set realistic user agent strings

Performance Optimization Techniques

For large-scale link following operations, consider these optimization strategies:

import threading
from concurrent.futures import ThreadPoolExecutor
import queue

class ParallelLinkFollower:
    def __init__(self, max_workers=5):
        self.max_workers = max_workers
        self.url_queue = queue.Queue()
        self.visited_urls = set()
        self.results = []
        self.lock = threading.Lock()

    def worker_crawl(self, worker_id):
        """Worker function for parallel crawling"""
        browser = mechanicalsoup.StatefulBrowser()

        while True:
            try:
                url = self.url_queue.get(timeout=5)

                with self.lock:
                    if url in self.visited_urls:
                        self.url_queue.task_done()
                        continue
                    self.visited_urls.add(url)

                print(f"Worker {worker_id} crawling: {url}")

                # Crawl the page
                browser.open(url)
                page = browser.get_current_page()

                # Extract new links
                new_links = extract_links(page, url)

                with self.lock:
                    for link in new_links:
                        if link['url'] not in self.visited_urls:
                            self.url_queue.put(link['url'])

                    self.results.append({
                        'url': url,
                        'title': page.title.string if page.title else '',
                        'links_found': len(new_links),
                        'worker_id': worker_id
                    })

                self.url_queue.task_done()
                time.sleep(1)  # Rate limiting

            except queue.Empty:
                print(f"Worker {worker_id} finished - no more URLs")
                break
            except Exception as e:
                print(f"Worker {worker_id} error: {e}")
                self.url_queue.task_done()

    def crawl_parallel(self, start_urls, max_pages=100):
        """Perform parallel crawling"""

        # Add start URLs to queue
        for url in start_urls:
            self.url_queue.put(url)

        # Start worker threads
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            futures = [
                executor.submit(self.worker_crawl, i) 
                for i in range(self.max_workers)
            ]

            # Wait for completion or max pages
            while (len(self.results) < max_pages and 
                   not self.url_queue.empty()):
                time.sleep(1)

            # Signal workers to stop by emptying queue
            while not self.url_queue.empty():
                try:
                    self.url_queue.get_nowait()
                    self.url_queue.task_done()
                except queue.Empty:
                    break

        return self.results[:max_pages]

# Usage example for parallel crawling
parallel_crawler = ParallelLinkFollower(max_workers=3)
results = parallel_crawler.crawl_parallel(
    start_urls=["https://example.com", "https://example.com/blog"],
    max_pages=50
)
print(f"Parallel crawling completed: {len(results)} pages")

Conclusion

While MechanicalSoup doesn't provide automatic link following out of the box, it offers all the necessary tools to build sophisticated crawling mechanisms. By combining link extraction, URL filtering, and session management, you can create robust automated navigation systems. For more complex scenarios involving JavaScript-heavy sites, you might also consider how to handle AJAX requests using Puppeteer or explore browser session handling techniques for advanced automation needs.

The key to successful automatic link following lies in understanding your target website's structure, implementing appropriate filters and limits, and ensuring respectful crawling behavior that doesn't overwhelm the target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon