Scrapy框架

时间：2020-07-10 00:21:06 阅读：68 评论：0 收藏：0 [点我收藏+]

Scrapy框架

创建scrapy项目

使用命令

scrapy starproject [爬虫项目名称]

创建基础爬虫文件

scrapy genspider [爬虫名字] [爬虫作用域] //默认使用basic模板

创建crawl爬虫文件

scrapy genspider -t crawl [爬虫名字] [爬虫作用域] //使用crawl模板创建

数据清洗

xpath

获取到response.xpath() --直接获取xpath对象

定义Item

Item 是保存爬取到的数据的容器；其使用方法和python字典类似。虽然您也可以在Scrapy中直接使用dict，但是 Item 提供了额外保护机制来避免拼写错误导致的未定义字段错误。

类似在ORM中做的一样，您可以通过创建一个 scrapy.Item 类，并且定义类型为 scrapy.Field 的类属性来定义一个Item。 (如果不了解ORM, 不用担心，您会发现这个步骤非常简单)

首先根据需要从dmoz.org获取到的数据对item进行建模。我们需要从dmoz中获取名字，url，以及网站的描述。对此，在item中定义相应的字段。编辑 tutorial 目录中的 items.py 文件:

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

数据存储

在pipelines.py

定义三个方法

class TestPipeline(object):
    # 初始化操作
    def __init__(self):
        pass
    # 爬虫开启时调用
    def open_spider(self,spider):
        pass
    
    # 存储相应操作
    def process_item(self,item,spider):
        return item
    
    # 爬虫关闭时调用
    def close_spider(self,spider):
        pass

保存为json数据

使用 JsonItemExporter 和 JsonLinesItemExporter会让操作变得更简单

第一种方式（列表的形式）：

import json
from scrapy.exporters import JsonItemExporter

class QsbkPipeline(object):
    def __init__(self):
        self.file = open(‘FileName.json‘, ‘wb‘)
        self.exporter = JsonItemExporter(self.file, ensure_ascii=False, encoding="utf-8")
        # 开启数据存储（json）
        self.exporter.start_exporting()
	# 爬虫开启时调用
    def open_spider(self, spider):
        print(‘爬虫开启‘)
	# 存储相关操作
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
	# 爬虫关闭时调用
    def close_spider(self, spider):
        print("爬虫关闭")
		# 关闭数据存储（json）
        self.exporter.finish_exporting()
        self.file.close()

第二种方式（多行字典形式）：

import json
from scrapy.exporters import JsonLinesItemExporter

class QsbkPipeline(object):
    def __init__(self):
        self.file = open(‘FileName.json‘, ‘wb‘)
        self.exporter = JsonLinesItemExporter(self.file, ensure_ascii=False, encoding="utf-8")
	# 爬虫开启时调用
    def open_spider(self, spider):
        print(‘爬虫开启‘)
	# 存储相关操作
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
	# 爬虫关闭时调用
    def close_spider(self, spider):
        print("爬虫关闭")
        self.file.close()

import json

class QsbkPipeline(object):
    def __init__(self):
        self.file = open("FileName.json",‘w‘,encoding = "utf-8")
    
    def open_spider(self,spider):
        pass
    
    def process_item(self,item,spider):
        Content = json.dumps(item,ensure_ascii=False)
        self.file.write(Content+"\n")
        return item
    
    def close_spider(self,spider):
        self.file.close()

CrawlSpider爬虫

使用命令创建爬虫

scrapy genspider -t crawl [爬虫名字] [域名]

LinkExtractors链接提取器

使用LinkExtractors 可以不用程序员自己提取要使用的url，然后发送请求，这些工作都是可以交给LinkExtractors，他会在所有爬的页面中找到满足规则的url，实现自动的爬取，一下对LinkExtractors类做一个简单的介绍；

class scrapy.linkextractors.LinkExtractor(
	allow = (),
	deny = (),
	allow_domains = (),
	deny_domains = (),
	deny_extensions = (),
	restrict_xpath = (),
	tags = (‘a‘,‘area‘),
	attrs = (‘href‘),
	canonicalize = True,
	unique = True,
	process_value = None
)

主要参数解析：

allow：允许的url。所有满足这个正则表达式的url都会被提取。
deny：禁止的url。所有满足这个正则表达式的url都不会被提取
allow_domains：允许的域名。只有在这个里面指定的域名的url才会被提取。
deny_domains：禁止的域名。所有在这个里面指定的域名的url都不会背提取。
restrict_xpaths：严格的xpath。和allow共同过滤链接

Rule规则类

定义爬虫的规则类。以下对这个类做一个简单的介绍：

class scrapy.spiders.Rule(
	link_extractor,
	callback = None,
	cb_kwargs = None,
	follow = None,
	process_links = None,
	process_request = None
)

主要参数解析：

link_extractor：一个LinkExtractor 对象，用于定义爬取规则。
callback：满足这个规则的url，应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为回调函数自己的回调函数。
follow：指定根据改规则从response中提取的链接是否需要跟进
process_links：从link_extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。

Scrapy Shell

可以方便我们做一些数据提取的测试代码
如果想要执行scrapy命令，那么毫无疑问，肯定是要先进入scrapy所在的环境中。
如果想要读取某个项目的配置信息，那么应该先进入到这个项目中，再执行scrapy shell [目标url]

Scrapy框架

原文：https://www.cnblogs.com/RashoMon/p/13276742.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)