Scrapy 2.0 documentation¶
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Having trouble? We’d like to help!
- Try the FAQ – it’s got answers to some common questions.
- Looking for specific information? Try the Index or Module Index.
- Ask or search questions in StackOverflow using the scrapy tag.
- Ask or search questions in the Scrapy subreddit.
- Search for questions on the archives of the scrapy-users mailing list.
- Ask a question in the #scrapy IRC channel,
- Report bugs with Scrapy in our issue tracker.
- Command line tool
- Learn about the command-line tool used to manage your Scrapy project.
- Write the rules to crawl your websites.
- Extract the data from web pages using XPath.
- Scrapy shell
- Test your extraction code in an interactive environment.
- Define the data you want to scrape.
- Item Loaders
- Populate your items with the extracted data.
- Item Pipeline
- Post-process and store your scraped data.
- Feed exports
- Output your scraped data using different formats and storages.
- Requests and Responses
- Understand the classes used to represent HTTP requests and responses.
- Link Extractors
- Convenient classes to extract links to follow from pages.
- Learn how to configure Scrapy and see all available settings.
- See all available exceptions and their meaning.
- Learn how to use Python’s builtin logging on Scrapy.
- Stats Collection
- Collect statistics about your scraping crawler.
- Sending e-mail
- Send email notifications when certain events occur.
- Telnet Console
- Inspect a running crawler using a built-in Python console.
- Web Service
- Monitor and control a crawler using a web service.
Solving specific problems¶
- Frequently Asked Questions
- Get answers to most frequently asked questions.
- Debugging Spiders
- Learn how to debug common problems of your Scrapy spider.
- Spiders Contracts
- Learn how to use contracts for testing your spiders.
- Common Practices
- Get familiar with some Scrapy common practices.
- Broad Crawls
- Tune Scrapy for crawling a lot domains in parallel.
- Using your browser’s Developer Tools for scraping
- Learn how to scrape with your browser’s developer tools.
- Selecting dynamically-loaded content
- Read webpage data that is loaded dynamically.
- Debugging memory leaks
- Learn how to find and get rid of memory leaks in your crawler.
- Downloading and processing files and images
- Download files and/or images associated with your scraped items.
- Deploying Spiders
- Deploying your Scrapy spiders and run them in a remote server.
- AutoThrottle extension
- Adjust crawl rate dynamically based on load.
- Check how Scrapy performs on your hardware.
- Jobs: pausing and resuming crawls
- Learn how to pause and resume crawls for large spiders.
- Use the coroutine syntax.
- Architecture overview
- Understand the Scrapy architecture.
- Downloader Middleware
- Customize how pages get requested and downloaded.
- Spider Middleware
- Customize the input and output of your spiders.
- Extend Scrapy with your custom functionality
- Core API
- Use it on extensions and middlewares to extend Scrapy functionality
- See all available signals and how to work with them.
- Item Exporters
- Quickly export your scraped items to a file (XML, CSV, etc).