Skip to main content
Technology & EngineeringWeb Scraping217 lines

Scrapy

Production-grade web scraping framework in Python with built-in crawling, pipelines, and middleware

Quick Summary27 lines
You are an expert in Scrapy, the Python web scraping framework, for building production-grade crawlers and data extraction pipelines.

## Key Points

- **Obey `robots.txt`** — keep `ROBOTSTXT_OBEY = True` and respect crawl-delay directives.
- **Use `response.follow`** instead of manually constructing absolute URLs. It handles relative URLs and referer headers automatically.
- **Define Items** for structured data instead of yielding raw dicts. Items enable validation and cleaner pipeline processing.
- **Set `DOWNLOAD_DELAY`** to at least 1 second for polite crawling. Use `AUTOTHROTTLE_ENABLED = True` for adaptive rate limiting.
- **Use `errback` on requests** to handle failures gracefully: `yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)`.
- **Export data incrementally** using feed exports (`FEEDS` setting) rather than accumulating everything in memory.
- **Run spiders with `scrapy crawl`** for full middleware/pipeline support, not by calling `parse()` directly.
- **Not handling missing selectors.** Always use `.get('')` with a default rather than `.get()` which returns `None`. Chain `.strip()` only after ensuring the value is a string.
- **Circular crawls without `allowed_domains`.** Without domain restrictions, a CrawlSpider can follow links to external sites indefinitely.
- **Blocking the Twisted reactor.** Never use `time.sleep()` or synchronous I/O in callbacks. Use Scrapy's async mechanisms or Deferred chains.
- **Pipeline ordering confusion.** Pipeline priority numbers control execution order (lower numbers run first). Ensure cleaning pipelines run before storage pipelines.
- **Memory issues on large crawls.** Enable `JOBDIR` for persistence and resumability. Monitor with `MEMUSAGE_ENABLED = True`.

## Quick Example

```bash
pip install scrapy
scrapy startproject myproject
cd myproject
scrapy genspider example example.com
```
skilldb get web-scraping-skills/ScrapyFull skill: 217 lines
Paste into your CLAUDE.md or agent config

Scrapy — Web Scraping

You are an expert in Scrapy, the Python web scraping framework, for building production-grade crawlers and data extraction pipelines.

Core Philosophy

Overview

Scrapy is a comprehensive Python framework for web crawling and scraping. It provides an asynchronous architecture built on Twisted, with built-in support for request scheduling, middleware, item pipelines, and export formats. Scrapy is designed for large-scale scraping projects where you need robust error handling, rate limiting, and data processing.

Setup & Configuration

pip install scrapy
scrapy startproject myproject
cd myproject
scrapy genspider example example.com

Project structure:

myproject/
  scrapy.cfg
  myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
      __init__.py
      example.py

Key settings in settings.py:

ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1.0
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
    'Accept-Language': 'en',
}
ITEM_PIPELINES = {
    'myproject.pipelines.CleaningPipeline': 300,
    'myproject.pipelines.DatabasePipeline': 800,
}
FEEDS = {
    'output.jsonl': {'format': 'jsonlines'},
}

Core Patterns

Basic spider

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for card in response.css('div.product-card'):
            yield {
                'name': card.css('h2.title::text').get('').strip(),
                'price': card.css('span.price::text').get('').strip(),
                'url': response.urljoin(card.css('a::attr(href)').get('')),
            }

        # Follow pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Using XPath selectors

def parse(self, response):
    for row in response.xpath('//table[@class="data"]//tr[position()>1]'):
        yield {
            'col1': row.xpath('td[1]/text()').get('').strip(),
            'col2': row.xpath('td[2]/text()').get('').strip(),
            'link': row.xpath('td[3]/a/@href').get(),
        }

CrawlSpider with rules

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SiteCrawler(CrawlSpider):
    name = 'site_crawler'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    def parse_product(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'description': response.css('div.description::text').getall(),
        }

Item and pipeline

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()


# pipelines.py
class CleaningPipeline:
    def process_item(self, item, spider):
        if item.get('price'):
            price_str = item['price'].replace('$', '').replace(',', '')
            item['price'] = float(price_str)
        return item


class DatabasePipeline:
    def open_spider(self, spider):
        import sqlite3
        self.conn = sqlite3.connect('products.db')
        self.conn.execute(
            'CREATE TABLE IF NOT EXISTS products (name TEXT, price REAL, url TEXT)'
        )

    def close_spider(self, spider):
        self.conn.commit()
        self.conn.close()

    def process_item(self, item, spider):
        self.conn.execute(
            'INSERT INTO products VALUES (?, ?, ?)',
            (item['name'], item['price'], item['url']),
        )
        return item

Handling authentication

class AuthSpider(scrapy.Spider):
    name = 'auth_spider'
    login_url = 'https://example.com/login'

    def start_requests(self):
        yield scrapy.Request(self.login_url, callback=self.login)

    def login(self, response):
        csrf_token = response.css('input[name="csrf"]::attr(value)').get()
        yield scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'user', 'password': 'pass', 'csrf': csrf_token},
            callback=self.after_login,
        )

    def after_login(self, response):
        if 'Welcome' in response.text:
            yield scrapy.Request('https://example.com/data', callback=self.parse_data)

    def parse_data(self, response):
        yield {'content': response.css('div.data::text').getall()}

Best Practices

  • Obey robots.txt — keep ROBOTSTXT_OBEY = True and respect crawl-delay directives.
  • Use response.follow instead of manually constructing absolute URLs. It handles relative URLs and referer headers automatically.
  • Define Items for structured data instead of yielding raw dicts. Items enable validation and cleaner pipeline processing.
  • Set DOWNLOAD_DELAY to at least 1 second for polite crawling. Use AUTOTHROTTLE_ENABLED = True for adaptive rate limiting.
  • Use errback on requests to handle failures gracefully: yield scrapy.Request(url, callback=self.parse, errback=self.handle_error).
  • Export data incrementally using feed exports (FEEDS setting) rather than accumulating everything in memory.
  • Run spiders with scrapy crawl for full middleware/pipeline support, not by calling parse() directly.

Common Pitfalls

  • Not handling missing selectors. Always use .get('') with a default rather than .get() which returns None. Chain .strip() only after ensuring the value is a string.
  • Circular crawls without allowed_domains. Without domain restrictions, a CrawlSpider can follow links to external sites indefinitely.
  • Blocking the Twisted reactor. Never use time.sleep() or synchronous I/O in callbacks. Use Scrapy's async mechanisms or Deferred chains.
  • Duplicate requests not filtered. Scrapy deduplicates requests by URL by default, but POST requests or URLs with different parameters may slip through. Use dont_filter=True only when intentional.
  • Pipeline ordering confusion. Pipeline priority numbers control execution order (lower numbers run first). Ensure cleaning pipelines run before storage pipelines.
  • Memory issues on large crawls. Enable JOBDIR for persistence and resumability. Monitor with MEMUSAGE_ENABLED = True.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →