Yes, WebMagic, a Java framework used for web scraping, does support both XPath and CSS selectors for extracting information from web pages. WebMagic is built around the concept of selectors to fetch elements from the HTML document, and it provides a range of selector options that you can use according to your preference or the specific needs of the task at hand.
Below is an example of how you can use both XPath and CSS selectors with WebMagic:
XPath Selector Example:
To use an XPath selector, you can utilize the XPathSelector
class or the xpath
method provided by the Selectable
interface in WebMagic.
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
public class XPathSelectorExample implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
// Use XPath to select elements
Selectable xpathSelectable = page.getHtml().xpath("//div[@class='some-class']/a");
// Extract the link text using XPath
String linkText = xpathSelectable.xpath("//a/text()").toString();
System.out.println("Link Text: " + linkText);
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new XPathSelectorExample())
.addUrl("http://example.com")
.thread(5)
.run();
}
}
CSS Selector Example:
For using a CSS selector, WebMagic provides the CssSelector
class or the css
method from the Selectable
interface.
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
public class CssSelectorExample implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
@Override
public void process(Page page) {
// Use CSS to select elements
Selectable cssSelectable = page.getHtml().css("div.some-class a");
// Extract the link text using CSS
String linkText = cssSelectable.xpath("//a/text()").toString();
System.out.println("Link Text: " + linkText);
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new CssSelectorExample())
.addUrl("http://example.com")
.thread(5)
.run();
}
}
In both examples, the process
method is where you write your logic to extract data using selectors. The Site
object represents the configuration for the crawler, such as retry times and sleep time between requests. The Spider
class is responsible for the execution of the web scraping process.
WebMagic's selector system is quite flexible, allowing you to chain selectors and use a combination of XPath and CSS selectors to navigate through complex HTML structures effectively. It provides a powerful way to scrape content from web pages with precision and efficiency.