How do I use Scrapy with Django?

Using Scrapy with Django involves integrating Scrapy into a Django project. This integration allows you to use Django's models and ORM (Object-Relational Mapping) within your Scrapy spiders.

Here’s how to use Scrapy with Django:

Set up a Django project

If you already have a Django project, you can skip this step, if not, you can create a new one using the following console command:

   django-admin startproject myproject

Then navigate into the project folder with:

   cd myproject

Create a new Django app

You can create a new Django app using the following command:

   python manage.py startapp myapp

Create Django models

In your Django app, create the models you wish to use. These will be the structure of the data you'll scrape. Here's an example of a model for a blog post:

   from django.db import models

   class BlogPost(models.Model):
       title = models.CharField(max_length=200)
       content = models.TextField()
       date_published = models.DateTimeField()

Install Scrapy

If you haven't installed Scrapy yet, you can do so with the following command:

   pip install scrapy

Create a Scrapy project

Use the Scrapy command line tool to create a new Scrapy project:

   scrapy startproject myscrapyproject

Integrate Django into Scrapy

You need to set the Django settings module environment variable in the Scrapy settings file so that Scrapy can load your Django settings. Add the following lines to myscrapyproject/settings.py:

   import os
   import sys

   DJANGO_PROJECT_PATH = '../../myproject/'
   sys.path.insert(0, DJANGO_PROJECT_PATH)

   os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'

   import django
   django.setup()

Replace '../../myproject/' with the relative path to your Django project and 'myproject.settings' with your Django settings module.

Create a Scrapy spider

Now you can create a Scrapy spider and use your Django models inside it. Here's an example of a spider that uses the BlogPost model:

   import scrapy
   from myproject.myapp.models import BlogPost

   class BlogSpider(scrapy.Spider):
       name = 'blogspider'
       start_urls = ['http://myblog.com']

       def parse(self, response):
           for post in response.css('div.post'):
               title = post.css('h2 a::text').get()
               content = post.css('div.content::text').get()
               date_published = post.css('div.date::text').get()

               BlogPost.objects.create(
                   title=title,
                   content=content,
                   date_published=date_published,
               )

In this example, the spider scrapes blog posts from a webpage and creates BlogPost instances in the Django database with the scraped data. The objects.create() method is a shortcut for creating new objects (database records).

Run the Scrapy spider

Finally, you can run your Scrapy spider with the following command:

   scrapy crawl blogspider

Remember to always validate and clean the data before saving it into the database in a real-world project, and respect the website's robots.txt rules while scraping.

How do I use Scrapy with Django?

Related Questions

How do I scrape PDFs with Scrapy?

How do I scrape images with Scrapy?

How can I use Scrapy in a distributed system?

Get Started Now