WebMagic is an open-source web crawling framework written in Java, designed for simplicity and ease of use. When you're scraping data with WebMagic, you can specify the output format by implementing a custom Pipeline
. A Pipeline
is responsible for processing the results extracted by WebMagic spiders.
By default, WebMagic comes with several built-in Pipeline
implementations, such as ConsolePipeline
, FilePipeline
, and JsonFilePipeline
, which output the scraped data to the console, a text file, or a JSON file, respectively. However, if you want to customize the output format (e.g., XML, CSV, or a database), you will need to implement your own Pipeline
.
Here's a basic example of how to create a custom Pipeline
that formats scraped data as CSV and writes it to a file:
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
public class CsvPipeline implements Pipeline {
private String filePath;
public CsvPipeline(String filePath) {
this.filePath = filePath;
}
@Override
public void process(ResultItems resultItems, Task task) {
// Open the file in append mode using FileWriter
try (FileWriter writer = new FileWriter(filePath, true);
PrintWriter printWriter = new PrintWriter(writer)) {
// Assume we have two fields in our resultItems that we want to write: 'title' and 'price'
String title = resultItems.get("title");
String price = resultItems.get("price");
// Format the data as CSV and write it to the file
printWriter.println(String.format("\"%s\",\"%s\"", title, price));
} catch (IOException e) {
e.printStackTrace();
}
}
}
You can then add your custom Pipeline
to your spider like so:
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
public class MySpider {
public static void main(String[] args) {
// ... other spider settings
Spider.create(new MyPageProcessor())
// Set the start URL
.addUrl("http://example.com")
// Add the custom CSV pipeline
.addPipeline(new CsvPipeline("output.csv"))
// Start crawling
.run();
}
}
In this example, the CsvPipeline
constructor takes the file path as an argument, which is where the CSV data will be written. The process
method is called for every ResultItems
object that the spider processes. Inside process
, we extract the desired data from ResultItems
, format it as CSV, and write it to the specified file.
Remember to handle exceptions properly and ensure that file I/O operations are done safely. The above example uses a try-with-resources statement to automatically close the FileWriter
and PrintWriter
after writing, which is a good practice to avoid resource leaks.
Keep in mind that in a real-world scenario, you might need to add headers, handle special characters, and manage concurrency if your spider is multithreaded. You may also want to create a more dynamic and robust CSV format handling to accommodate various types of data and ensure proper escaping of CSV values.