Scrapy 2.0 documentation¶

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Getting help¶

Having trouble? We’d like to help!

Try the FAQ – it’s got answers to some common questions.
Looking for specific information? Try the Index or Module Index.
Ask or search questions in StackOverflow using the scrapy tag.
Ask or search questions in the Scrapy subreddit.
Search for questions on the archives of the scrapy-users mailing list.
Ask a question in the #scrapy IRC channel,
Report bugs with Scrapy in our issue tracker.

First steps¶

Scrapy at a glance: Understand what Scrapy is and how it can help you.
Installation guide: Get Scrapy installed on your computer.
Scrapy Tutorial: Write your first Scrapy project.
Examples: Learn more by playing with a pre-made Scrapy project.

Basic concepts¶

Command line tool: Learn about the command-line tool used to manage your Scrapy project.
Spiders: Write the rules to crawl your websites.
Selectors: Extract the data from web pages using XPath.
Scrapy shell: Test your extraction code in an interactive environment.
Items: Define the data you want to scrape.
Item Loaders: Populate your items with the extracted data.
Item Pipeline: Post-process and store your scraped data.
Feed exports: Output your scraped data using different formats and storages.
Requests and Responses: Understand the classes used to represent HTTP requests and responses.
Link Extractors: Convenient classes to extract links to follow from pages.
Settings: Learn how to configure Scrapy and see all available settings.
Exceptions: See all available exceptions and their meaning.

Built-in services¶

Logging: Learn how to use Python’s builtin logging on Scrapy.
Stats Collection: Collect statistics about your scraping crawler.
Sending e-mail: Send email notifications when certain events occur.
Telnet Console: Inspect a running crawler using a built-in Python console.
Web Service: Monitor and control a crawler using a web service.

Solving specific problems¶

Frequently Asked Questions: Get answers to most frequently asked questions.
Debugging Spiders: Learn how to debug common problems of your Scrapy spider.
Spiders Contracts: Learn how to use contracts for testing your spiders.
Common Practices: Get familiar with some Scrapy common practices.
Broad Crawls: Tune Scrapy for crawling a lot domains in parallel.
Using your browser’s Developer Tools for scraping: Learn how to scrape with your browser’s developer tools.
Selecting dynamically-loaded content: Read webpage data that is loaded dynamically.
Debugging memory leaks: Learn how to find and get rid of memory leaks in your crawler.
Downloading and processing files and images: Download files and/or images associated with your scraped items.
Deploying Spiders: Deploying your Scrapy spiders and run them in a remote server.
AutoThrottle extension: Adjust crawl rate dynamically based on load.
Benchmarking: Check how Scrapy performs on your hardware.
Jobs: pausing and resuming crawls: Learn how to pause and resume crawls for large spiders.
Coroutines: Use the coroutine syntax.
asyncio: Use asyncio and asyncio-powered libraries.

Extending Scrapy¶

Architecture overview: Understand the Scrapy architecture.
Downloader Middleware: Customize how pages get requested and downloaded.
Spider Middleware: Customize the input and output of your spiders.
Extensions: Extend Scrapy with your custom functionality
Core API: Use it on extensions and middlewares to extend Scrapy functionality
Signals: See all available signals and how to work with them.
Item Exporters: Quickly export your scraped items to a file (XML, CSV, etc).

All the rest¶

Release notes: See what has changed in recent Scrapy versions.
Contributing to Scrapy: Learn how to contribute to the Scrapy project.
Versioning and API Stability: Understand Scrapy versioning and API stability.