要激活一个meddleware, 要在设置里面添加。例如:
DOWNLOADER_MIDDLEWARES = { ‘myproject.middlewares.CustomDownloaderMiddleware‘: 543, }
key是要激活的middleware的路径, value是它的value。其实scrapy本身就已经内置了很多middleware,所以在激活一个自己编写的middleware的时候,要在文档中查找默认的middleware的序号,以便把自己的middleware插入到正确的位置。
默认的middleware如下:
{ ‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware‘: 100, ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware‘: 300, ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware‘: 350, ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware‘: 400, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘: 500, ‘scrapy.downloadermiddlewares.retry.RetryMiddleware‘: 550, ‘scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware‘: 560, ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware‘: 580, ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘: 590, ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware‘: 600, ‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware‘: 700, ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware‘: 750, ‘scrapy.downloadermiddlewares.stats.DownloaderStats‘: 850, ‘scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware‘: 900, }
序号越小的middleware越接近engine,越大的越靠近downloader。
每一个downloader最多只能有四个methods。分别是:process_requests, process_response,process_exception和from_crawler。我们编写的downloader至少要有其中一个。
在engine发送requests给downloader这条路上,对于这个request,会依次调用所有middlware对它进行处理。(序号由小到大)
在downloader发送response给engine这条路上,对于这个response,会依次调用所有middleware对它进行处理。(序号由大到小)
下面是对这四个方法的介绍:
process_request(request, spider) Parameters request (Request object) – the request being processed spider (Spider object) – the spider for which this request is intended
process_resquest可以 return None
, return a Response
object, return a Request
object, or raise IgnoreRequest
.
process_response(request, response, spider) Parameters request (is a Request object) – the request that originated the response response (Response object) – the response being processed spider (Spider object) – the spider for which this response is intended
process_response可以 return a Response
object, return a Request
object or raise a IgnoreRequest
exception.
process_exception(request, exception, spider) Parameters request (is a Request object) – the request that generated the exception exception (an Exception object) – the raised exception spider (Spider object) – the spider for which this request is intended
process_exception可以return: either None
, a Response
object, or a Request
object.
from_crawler(cls, crawler) If present, this classmethod is called to create a middleware instance from a Crawler. It must return a new instance of the middleware. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy. Parameters crawler (Crawler object) – crawler that uses this middleware
downloader middleware的三个methods不同返回的情况
原文:https://www.cnblogs.com/--here--gold--you--want/p/12945125.html