Table of contents

How can I handle relative URLs when scraping with jsoup?

When web scraping with jsoup, one of the most common challenges developers face is handling relative URLs. Websites often use relative paths for links, images, and other resources, which can cause issues when extracting and following these URLs programmatically. This comprehensive guide will show you how to properly resolve relative URLs to absolute URLs using jsoup's built-in functionality.

Understanding Relative URLs in Web Scraping

Relative URLs are paths that don't include the full domain and protocol. They're commonly used in HTML documents to reference resources on the same domain. For example:

  • /products/item.html (absolute path)
  • ../category/products.html (relative path with directory traversal)
  • item.html (relative to current directory)
  • #section (fragment identifier)

When scraping websites, these relative URLs need to be converted to absolute URLs to be useful for further processing, data extraction, or navigation between pages.

Basic Relative URL Resolution with jsoup

jsoup provides excellent built-in support for handling relative URLs through the absUrl() method. Here's how to use it:

Simple URL Resolution

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class RelativeUrlHandler {
    public static void main(String[] args) throws Exception {
        String url = "https://example.com/products/";
        Document doc = Jsoup.connect(url).get();

        // Get all links and convert relative URLs to absolute
        Elements links = doc.select("a[href]");

        for (Element link : links) {
            String relativeUrl = link.attr("href");
            String absoluteUrl = link.absUrl("href");

            System.out.println("Relative: " + relativeUrl);
            System.out.println("Absolute: " + absoluteUrl);
            System.out.println("---");
        }
    }
}

Handling Different URL Types

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.net.URL;

public class ComprehensiveUrlHandler {

    public static void processUrls(String baseUrl) throws Exception {
        Document doc = Jsoup.connect(baseUrl).get();

        // Handle different types of links
        handleLinks(doc);
        handleImages(doc);
        handleStylesheets(doc);
        handleScripts(doc);
    }

    private static void handleLinks(Document doc) {
        Elements links = doc.select("a[href]");
        System.out.println("Processing " + links.size() + " links:");

        for (Element link : links) {
            String href = link.attr("href");
            String absoluteUrl = link.absUrl("href");

            // Filter out empty URLs and fragments
            if (!absoluteUrl.isEmpty() && !href.startsWith("#")) {
                System.out.println("Link: " + absoluteUrl);

                // Optional: Validate URL format
                if (isValidUrl(absoluteUrl)) {
                    // Process the valid URL
                    processValidUrl(absoluteUrl);
                }
            }
        }
    }

    private static void handleImages(Document doc) {
        Elements images = doc.select("img[src]");
        System.out.println("Processing " + images.size() + " images:");

        for (Element img : images) {
            String src = img.attr("src");
            String absoluteUrl = img.absUrl("src");

            if (!absoluteUrl.isEmpty()) {
                System.out.println("Image: " + absoluteUrl);
            }
        }
    }

    private static void handleStylesheets(Document doc) {
        Elements stylesheets = doc.select("link[rel=stylesheet]");
        for (Element stylesheet : stylesheets) {
            String absoluteUrl = stylesheet.absUrl("href");
            if (!absoluteUrl.isEmpty()) {
                System.out.println("CSS: " + absoluteUrl);
            }
        }
    }

    private static void handleScripts(Document doc) {
        Elements scripts = doc.select("script[src]");
        for (Element script : scripts) {
            String absoluteUrl = script.absUrl("src");
            if (!absoluteUrl.isEmpty()) {
                System.out.println("Script: " + absoluteUrl);
            }
        }
    }

    private static boolean isValidUrl(String url) {
        try {
            new URL(url);
            return true;
        } catch (Exception e) {
            return false;
        }
    }

    private static void processValidUrl(String url) {
        // Add your URL processing logic here
        // This could include adding to a queue for further scraping
    }
}

Advanced URL Resolution Techniques

Custom Base URL Resolution

Sometimes you need to resolve URLs against a different base URL than the document's original URL:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.net.URI;
import java.net.URISyntaxException;

public class CustomBaseUrlResolver {

    public static String resolveUrl(String baseUrl, String relativeUrl) {
        try {
            URI base = new URI(baseUrl);
            URI resolved = base.resolve(relativeUrl);
            return resolved.toString();
        } catch (URISyntaxException e) {
            System.err.println("Error resolving URL: " + e.getMessage());
            return relativeUrl;
        }
    }

    public static void main(String[] args) throws Exception {
        String htmlContent = """
            <html>
                <body>
                    <a href="/category/products">Products</a>
                    <a href="../about.html">About</a>
                    <a href="contact.html">Contact</a>
                </body>
            </html>
        """;

        String customBaseUrl = "https://mystore.com/shop/electronics/";
        Document doc = Jsoup.parse(htmlContent, customBaseUrl);

        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            String absoluteUrl = link.absUrl("href");
            String customResolved = resolveUrl(customBaseUrl, href);

            System.out.println("Original: " + href);
            System.out.println("jsoup resolved: " + absoluteUrl);
            System.out.println("Custom resolved: " + customResolved);
            System.out.println("---");
        }
    }
}

Building a URL Collection System

Here's a practical example that builds a comprehensive URL collection system:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

public class UrlCollectionSystem {
    private Set<String> processedUrls = ConcurrentHashMap.newKeySet();
    private Queue<String> urlQueue = new LinkedList<>();
    private String baseDomain;

    public UrlCollectionSystem(String baseDomain) {
        this.baseDomain = baseDomain;
    }

    public void scrapeUrls(String startUrl, int maxDepth) {
        urlQueue.offer(startUrl);

        for (int depth = 0; depth < maxDepth && !urlQueue.isEmpty(); depth++) {
            int currentLevelSize = urlQueue.size();

            for (int i = 0; i < currentLevelSize; i++) {
                String currentUrl = urlQueue.poll();
                if (currentUrl != null && !processedUrls.contains(currentUrl)) {
                    processUrl(currentUrl);
                }
            }
        }
    }

    private void processUrl(String url) {
        try {
            if (processedUrls.contains(url)) {
                return;
            }

            processedUrls.add(url);
            System.out.println("Processing: " + url);

            Document doc = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
                .timeout(5000)
                .get();

            // Extract and process all links
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                String absoluteUrl = link.absUrl("href");

                if (isValidUrl(absoluteUrl) && isSameDomain(absoluteUrl)) {
                    urlQueue.offer(absoluteUrl);
                }
            }

            // Extract other resources
            extractResources(doc);

        } catch (Exception e) {
            System.err.println("Error processing " + url + ": " + e.getMessage());
        }
    }

    private void extractResources(Document doc) {
        // Extract images
        Elements images = doc.select("img[src]");
        for (Element img : images) {
            String absoluteUrl = img.absUrl("src");
            if (!absoluteUrl.isEmpty()) {
                System.out.println("Found image: " + absoluteUrl);
            }
        }

        // Extract stylesheets
        Elements stylesheets = doc.select("link[rel=stylesheet]");
        for (Element css : stylesheets) {
            String absoluteUrl = css.absUrl("href");
            if (!absoluteUrl.isEmpty()) {
                System.out.println("Found stylesheet: " + absoluteUrl);
            }
        }
    }

    private boolean isValidUrl(String url) {
        return url != null && 
               !url.isEmpty() && 
               (url.startsWith("http://") || url.startsWith("https://")) &&
               !url.contains("#") && // Skip fragments
               !url.startsWith("mailto:") &&
               !url.startsWith("tel:");
    }

    private boolean isSameDomain(String url) {
        try {
            return url.contains(baseDomain);
        } catch (Exception e) {
            return false;
        }
    }

    public Set<String> getProcessedUrls() {
        return new HashSet<>(processedUrls);
    }
}

Error Handling and Best Practices

Robust URL Processing

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.logging.Logger;
import java.util.logging.Level;

public class RobustUrlProcessor {
    private static final Logger logger = Logger.getLogger(RobustUrlProcessor.class.getName());

    public static void processUrlsSafely(String baseUrl) {
        try {
            Document doc = Jsoup.connect(baseUrl)
                .timeout(10000)
                .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
                .followRedirects(true)
                .get();

            Elements links = doc.select("a[href]");

            for (Element link : links) {
                try {
                    String href = link.attr("href");
                    if (href == null || href.trim().isEmpty()) {
                        continue;
                    }

                    String absoluteUrl = link.absUrl("href");

                    if (validateUrl(absoluteUrl)) {
                        processValidatedUrl(absoluteUrl, link.text());
                    }

                } catch (Exception e) {
                    logger.log(Level.WARNING, "Error processing individual link", e);
                }
            }

        } catch (Exception e) {
            logger.log(Level.SEVERE, "Error processing page: " + baseUrl, e);
        }
    }

    private static boolean validateUrl(String url) {
        if (url == null || url.trim().isEmpty()) {
            return false;
        }

        try {
            URL urlObj = new URL(url);

            // Check protocol
            String protocol = urlObj.getProtocol();
            if (!"http".equals(protocol) && !"https".equals(protocol)) {
                return false;
            }

            // Check if it's not a fragment-only URL
            if (url.startsWith("#")) {
                return false;
            }

            return true;

        } catch (MalformedURLException e) {
            logger.log(Level.FINE, "Invalid URL format: " + url, e);
            return false;
        }
    }

    private static void processValidatedUrl(String url, String linkText) {
        System.out.println("Valid URL found: " + url + " (Text: " + linkText + ")");
        // Add your processing logic here
    }
}

Integration with Modern Web Scraping Workflows

When working with complex websites that require JavaScript rendering, you might need to combine jsoup with other tools. For sites with dynamic content loading, consider using browser automation tools for handling AJAX requests before processing URLs with jsoup.

Performance Optimization Tips

  1. URL Caching: Cache resolved URLs to avoid repeated processing
  2. Batch Processing: Process URLs in batches for better performance
  3. Connection Pooling: Reuse HTTP connections when possible
  4. Parallel Processing: Use concurrent processing for independent URLs
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.List;
import java.util.stream.Collectors;

public class ParallelUrlProcessor {
    private ExecutorService executor = Executors.newFixedThreadPool(5);

    public List<String> processUrlsInParallel(List<String> urls) {
        List<CompletableFuture<String>> futures = urls.stream()
            .map(url -> CompletableFuture.supplyAsync(() -> processUrl(url), executor))
            .collect(Collectors.toList());

        return futures.stream()
            .map(CompletableFuture::join)
            .filter(result -> result != null)
            .collect(Collectors.toList());
    }

    private String processUrl(String url) {
        try {
            Connection connection = Jsoup.connect(url)
                .timeout(5000)
                .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)");

            // Process and return result
            return connection.get().title();
        } catch (Exception e) {
            return null;
        }
    }
}

Conclusion

Handling relative URLs properly is crucial for effective web scraping with jsoup. The absUrl() method provides a simple and reliable way to convert relative URLs to absolute ones, while custom resolution techniques offer more control when needed. By implementing proper error handling, validation, and performance optimization strategies, you can build robust scraping systems that handle URLs effectively across different website structures and requirements.

Remember to always respect robots.txt files and implement appropriate delays between requests to avoid overwhelming target servers. For complex scenarios involving dynamic content, consider combining jsoup with other tools in your scraping toolkit.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon