Using Scrapy with Django involves integrating Scrapy into a Django project. This integration allows you to use Django's models and ORM (Object-Relational Mapping) within your Scrapy spiders.
Here’s how to use Scrapy with Django:
- Set up a Django project
If you already have a Django project, you can skip this step, if not, you can create a new one using the following console command:
django-admin startproject myproject
Then navigate into the project folder with:
cd myproject
- Create a new Django app
You can create a new Django app using the following command:
python manage.py startapp myapp
- Create Django models
In your Django app, create the models you wish to use. These will be the structure of the data you'll scrape. Here's an example of a model for a blog post:
from django.db import models
class BlogPost(models.Model):
title = models.CharField(max_length=200)
content = models.TextField()
date_published = models.DateTimeField()
- Install Scrapy
If you haven't installed Scrapy yet, you can do so with the following command:
pip install scrapy
- Create a Scrapy project
Use the Scrapy command line tool to create a new Scrapy project:
scrapy startproject myscrapyproject
- Integrate Django into Scrapy
You need to set the Django settings module environment variable in the Scrapy settings file so that Scrapy can load your Django settings. Add the following lines to myscrapyproject/settings.py
:
import os
import sys
DJANGO_PROJECT_PATH = '../../myproject/'
sys.path.insert(0, DJANGO_PROJECT_PATH)
os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'
import django
django.setup()
Replace '../../myproject/'
with the relative path to your Django project and 'myproject.settings'
with your Django settings module.
- Create a Scrapy spider
Now you can create a Scrapy spider and use your Django models inside it. Here's an example of a spider that uses the BlogPost
model:
import scrapy
from myproject.myapp.models import BlogPost
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://myblog.com']
def parse(self, response):
for post in response.css('div.post'):
title = post.css('h2 a::text').get()
content = post.css('div.content::text').get()
date_published = post.css('div.date::text').get()
BlogPost.objects.create(
title=title,
content=content,
date_published=date_published,
)
In this example, the spider scrapes blog posts from a webpage and creates BlogPost
instances in the Django database with the scraped data. The objects.create()
method is a shortcut for creating new objects (database records).
- Run the Scrapy spider
Finally, you can run your Scrapy spider with the following command:
scrapy crawl blogspider
Remember to always validate and clean the data before saving it into the database in a real-world project, and respect the website's robots.txt rules while scraping.