WebMagic is an open source Java framework used for web scraping. It provides a way to fetch web page content and extract data without the need for you to write a lot of boilerplate code.
To extract attributes from HTML elements using WebMagic, you typically use the Selectable
interface which offers several methods for extracting data, including attributes.
Here's a step-by-step guide on how to extract attributes from HTML elements with WebMagic:
Set Up WebMagic: First, you need to add WebMagic to your project. If you're using Maven, you can include the following dependency in your
pom.xml
:<dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> </dependency>
Make sure you check for the latest version on the Maven Repository.
Create a Processor: Implement the
PageProcessor
interface to customize how the webpage is processed.import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; public class MyPageProcessor implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Override public void process(Page page) { // Use CSS selectors to target the HTML elements // and extract the attributes you need. // For example, to get the "href" attribute of all links: page.putField("links", page.getHtml().css("a").all()); } @Override public Site getSite() { return site; } }
Extract Attributes: Within the
process
method of yourPageProcessor
, use theSelectable
methods to extract attributes. TheSelectable
object represents a part of the HTML that you can interact with. For example, to extract thehref
attribute from all anchor tags, you could do something like this:@Override public void process(Page page) { // Extract the href attribute of every link List<String> links = page.getHtml().$("a").links().all(); page.putField("links", links); }
Run the Spider: After setting up your
PageProcessor
, you can create aSpider
to crawl the web pages.import us.codecraft.webmagic.Spider; public class MyCrawler { public static void main(String[] args) { Spider.create(new MyPageProcessor()) .addUrl("http://example.com") // The starting URL .thread(5) // Number of threads .run(); // Start the crawler } }
Pipeline: If you want to persist the extracted data, implement a
Pipeline
to process the result items.import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.pipeline.Pipeline; public class MyPipeline implements Pipeline { @Override public void process(ResultItems resultItems, Task task) { // Here you can handle the extracted data for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) { System.out.println(entry.getKey() + ":\t" + entry.getValue()); } } }
And add the pipeline to your spider:
Spider.create(new MyPageProcessor()) .addUrl("http://example.com") .addPipeline(new MyPipeline()) .thread(5) .run();
Remember that web scraping should be done responsibly, respecting the website's robots.txt
rules and terms of service. Always ensure that your activities comply with legal requirements and ethical considerations.