Does WebMagic support XPath and CSS selectors?

Yes, WebMagic, a Java framework used for web scraping, does support both XPath and CSS selectors for extracting information from web pages. WebMagic is built around the concept of selectors to fetch elements from the HTML document, and it provides a range of selector options that you can use according to your preference or the specific needs of the task at hand.

Below is an example of how you can use both XPath and CSS selectors with WebMagic:

XPath Selector Example:

To use an XPath selector, you can utilize the XPathSelector class or the xpath method provided by the Selectable interface in WebMagic.

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;

public class XPathSelectorExample implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        // Use XPath to select elements
        Selectable xpathSelectable = page.getHtml().xpath("//div[@class='some-class']/a");
        // Extract the link text using XPath
        String linkText = xpathSelectable.xpath("//a/text()").toString();
        System.out.println("Link Text: " + linkText);
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new XPathSelectorExample())
              .addUrl("http://example.com")
              .thread(5)
              .run();
    }
}

CSS Selector Example:

For using a CSS selector, WebMagic provides the CssSelector class or the css method from the Selectable interface.

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;

public class CssSelectorExample implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        // Use CSS to select elements
        Selectable cssSelectable = page.getHtml().css("div.some-class a");
        // Extract the link text using CSS
        String linkText = cssSelectable.xpath("//a/text()").toString();
        System.out.println("Link Text: " + linkText);
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new CssSelectorExample())
              .addUrl("http://example.com")
              .thread(5)
              .run();
    }
}

In both examples, the process method is where you write your logic to extract data using selectors. The Site object represents the configuration for the crawler, such as retry times and sleep time between requests. The Spider class is responsible for the execution of the web scraping process.

WebMagic's selector system is quite flexible, allowing you to chain selectors and use a combination of XPath and CSS selectors to navigate through complex HTML structures effectively. It provides a powerful way to scrape content from web pages with precision and efficiency.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon