crawl spider

时间：2018-06-18 18:06:57 阅读：120 评论：0 收藏：0 [点我收藏+]

crawlspider

使用
scrapy genspider -t crawl 文件名字网址

crawlspider是什么？
也是一个spider，是Spider的一个子类，所以其功能要比Spider要强大
多的一个功能是：提取链接的功能，根据一定的规则，提取指定的链接

链接提取器
LinkExtractor(
allow=xxx, # 正则表达式，要（*）
deny=xxx, # 正则表达式，不要这个
restrict_xpaths=xxx, # xpath路径（*）
restrict_css=xxx, # 选择器（*）
deny_domains=xxx, # 不允许的域名
)

通过正则提取链接
links = LinkExtractor(allow=r‘/movie/\?page=\d‘)
将所有包含这个正则表达式的href全部获取到返回
links.extract_links(response)进行查看提取到的链接
【注】将重复的url去除掉
通过xpath提取
links = LinkExtractor(restrict_xpaths=‘//ul[@class="pagination pagination-sm"]/li/a‘)
通过css提取
links = LinkExtractor(restrict_css=‘.pagination > li > a‘)

crawl spider

原文：https://www.cnblogs.com/airapple/p/9195467.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)