WebMagic is a simple and flexible Java framework used for web scraping. It provides a way to fetch, parse, and extract data from web pages using various selectors. Here's a breakdown of the selectors supported by WebMagic:
1. XPath Selector
XPath is a language for selecting nodes in XML documents, which can also be used with HTML. WebMagic supports XPath selectors to extract data from an HTML document. XPath selectors are powerful and can be used to navigate through elements and attributes in an HTML document.
Example (Java):
Html html = page.getHtml();
List<String> links = html.xpath("//a/@href").all();
String title = html.xpath("//title/text()").get();
2. CSS Selector
CSS selectors are pattern-based selectors used to select elements based on their id, class, types, attributes, and values of attributes, etc. WebMagic allows the use of CSS selectors to extract data from pages.
Example (Java):
Html html = page.getHtml();
List<String> articleTitles = html.css("h1.article-title::text").all();
String author = html.css("div.author::text").get();
3. Regex Selector
Regular expressions are used to match patterns in text. WebMagic provides regex selectors that allow you to extract data using regular expression patterns.
Example (Java):
Html html = page.getHtml();
String scriptContent = html.regex("<script>(.*?)</script>").get();
List<String> emailAddresses = html.regex("[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,6}").all();
4. JsonPath Selector
JsonPath is to JSON what XPath is to XML. It provides a way to extract data from a JSON structure. If the target page provides JSON-formatted data, WebMagic can use JsonPath selectors to parse and extract the needed information.
Example (Java):
Json json = page.getJson();
List<String> titles = json.jsonPath("$.items[*].title").all();
When using WebMagic, you can combine these selectors to create a powerful data extraction mechanism that fits the structure and complexity of the web page you are scraping. Here's an example of how you might use WebMagic to scrape data:
Example (Java):
Spider.create(new YourCustomPageProcessor())
.addUrl("http://example.com")
.addPipeline(new ConsolePipeline())
.thread(5)
.run();
You would need to define YourCustomPageProcessor
to use the appropriate selectors for the data you want to extract.
Remember that when scraping websites, you should always check the website's robots.txt
to understand the scraping policy and ensure that you comply with their terms of service. Additionally, be respectful of the website's resources and do not overload their servers with too many requests in a short period.