Python爬虫之Scrapy框架

时间：2021-06-17 23:15:08 阅读：43 评论：0 收藏：0 [点我收藏+]

Scrapy的命令

Scrapy框架常用命令

1、创建项目：

scrapy startproject <项目名字>

2、创建爬虫：

cd <项目名字>
scrapy genspider <爬虫名字> <允许爬取的域名>

3、运行爬虫：

scrapy crawl <爬虫名字>

setings.py常用配置

USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36‘  # UA伪装
ROBOTSTXT_OBEY = False  # 不遵守Robot协议
LOG_LEVEL = "WARNING"  # 打印日志级别

Scrapy的概念

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架

Scrapy的工作流程

流程:

爬虫中起始的url构造成request对象-->爬虫中间件-->引擎-->调度器
调度器把request-->引擎-->下载中间件--->下载器
下载器发送请求，获取response响应---->下载中间件---->引擎--->爬虫中间件--->爬虫
爬虫提取url地址，组装成request对象---->爬虫中间件--->引擎--->调度器，重复步骤2
爬虫提取数据--->引擎--->管道处理和保存数据

注意：

图中中文是为了方便理解后加上去的
图中绿色线条的表示数据的传递
注意图中中间件的位置，决定了其作用
注意其中引擎的位置，所有的模块之前相互独立，只和引擎进行交互

scrapy各模块具体作用

scrapy中每个模块的具体作用：

引擎(engine)：负责数据和信号在不腰痛模块间的传递
调度器(scheduler)：实现一个队列，存放引擎发过来的request请求对象
下载器(downloader)：发送引擎发过来的request请求，获取响应，并将响应交给引擎
爬虫(spider)：处理引擎发过来的response，提取数据，提取url，并交给引擎
管道(pipeline)：处理引擎传递过来的数据，比如存储
下载中间件(downloader middleware)：可以自定义的下载扩展，比如设置代理ip
爬虫中间件(spider middleware)：可以自定义request请求和进行response过滤，与下载中间件作用重复

Scrapy项目的结构

三个内置对象

request请求对象

response响应对象

item数据对象

五个组件

spider爬虫模块

pipeline管道

scheduler调度器

downloader下载器

engine引擎

两个中间件

process_request(self, request, spider)

process_response(self, request, response, spider)

Scrapy项目开发流程

创建项目

scrapy startproject <项目名字>

示例：scrapy startproject mySpider

创建爬虫

cd <项目名字>

scrapy genspider <爬虫名字> <允许爬取的域名>

示例：

cd mySpider

scrapy genspider itcast itcast.cn

数据建模

中间件

爬虫文件(itcast.py)

import scrapy

class ItcastSpider(scrapy.Spider):  # 继承scrapy.spider
	# 爬虫名字 
    name = ‘itcast‘ 
    # 允许爬取的范围
    allowed_domains = [‘itcast.cn‘] 
    # 开始爬取的url地址
    start_urls = [‘http://www.itcast.cn/channel/teacher.shtml‘]
    
    # 数据提取的方法，接受下载中间件传过来的response
    def parse(self, response): 
    	# scrapy的response对象可以直接进行xpath
    	names = response.xpath(‘//div[@class="tea_con"]//li/div/h3/text()‘) 
    	print(names)

    	# 获取具体数据文本的方式如下
        # 分组
    	li_list = response.xpath(‘//div[@class="tea_con"]//li‘) 
        for li in li_list:
        	# 创建一个数据字典
            item = {}
            # 利用scrapy封装好的xpath选择器定位元素，并通过extract()或extract_first()来获取结果
            item[‘name‘] = li.xpath(‘.//h3/text()‘).extract_first() # 老师的名字
            item[‘level‘] = li.xpath(‘.//h4/text()‘).extract_first() # 老师的级别
            item[‘text‘] = li.xpath(‘.//p/text()‘).extract_first() # 老师的介绍
            print(item)

附：

需要修改的是allowed_domains，start_urls，parse()

定位元素以及提取数据、属性值的方法：

response.xpath方法的返回结果是一个类似list的类型，其中包含的是selector对象，操作和列表一样，但是有一些额外的方法
额外方法extract()：返回一个包含有字符串的列表
额外方法extract_first()：返回列表中的第一个字符串，列表为空没有返回None

response响应对象的常用属性

response.url：当前响应的url地址
response.request.url：当前响应对应的请求的url地址
response.headers：响应头
response.requests.headers：当前响应的请求头
response.body：响应体，也就是html代码，byte类型
response.status：响应状态码

保存数据

在settings.py配置启用管道

ITEM_PIPELINES = {
    ‘myspider.pipelines.ItcastPipeline‘: 400
}

配置项中键为使用的管道类，管道类使用.进行分割，第一个为项目目录，第二个为文件，第三个为定义的管道类。

配置项中值为管道的使用顺序，设置的数值约小越优先执行，该值一般设置为1000以内。

运行scrapy

在项目目录下执行:

scrapy crawl <爬虫名字>

示例：scrapy crawl itcast

Scrapy的使用

user-agent

settings.py中修改/添加:

USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36‘  # UA伪装

固定cookie,适用于cookie周期长(常见于一些不规范的网站)，爬取数据量不大，能在cookie过期之前把所有的数据拿到的网站

方法一：重构scrapy的start_rquests方法，将带cookies参数的请求返回给引擎

爬虫文件中：

def start_requests(self):  # 重构start_requests方法
    # 这个cookies_str是抓包获取的
    cookies_str = ‘...‘ # 抓包获取
    # 将cookies_str转换为cookies_dict
    cookies_dict = {i.split(‘=‘)[0]:i.split(‘=‘)[1] for i in cookies_str.split(‘; ‘)}
    yield scrapy.Request(  # 将带cookies的请求返回给引擎
        self.start_urls[0],
        callback=self.parse,
        cookies=cookies_dict
    )

注意：

? scrapy中cookie不能够放在headers中，在构造请求的时候有专门的cookies参数，能够接受字典形式的coookie

方法二:scrapy.FormRequest()发送post请求,适用于频繁更换cookie的网站

import scrapy

class Login2Spider(scrapy.Spider):
   name = ‘login‘
   allowed_domains = [‘‘]
   start_urls = [‘‘]

   def parse(self, response):
       authenticity_token = response.xpath("//input[@name=‘authenticity_token‘]/@value").extract_first()
       utf8 = response.xpath("//input[@name=‘utf8‘]/@value").extract_first()
       commit = response.xpath("//input[@name=‘commit‘]/@value").extract_first()
        
        #构造POST请求，传递给引擎
       yield scrapy.FormRequest(  # FormRequest请求
           "https://github.com/session",
           formdata={
               "utf8":utf8,
               "commit":commit,
               "login":"username",
               "password":"***"
           },
           callback=self.parse_login
       )

   def parse_login(self,response):
       print(response.body)

附:

? 在settings.py中通过设置COOKIES_DEBUG=TRUE 能够在终端看到cookie的传递传递过程

ip

翻页请求

数据建模(items)

在items.py文件中定义要提取的字段：

class MyspiderItem(scrapy.Item): 
    name = scrapy.Field()   # 讲师的名字
    title = scrapy.Field()  # 讲师的职称
    desc = scrapy.Field()   # 讲师的介绍

在爬虫文件中导入并且实例化，之后的使用方法和使用字典相同

itcast.py:

from myspider.items import MyspiderItem   # 导入Item，注意路径
...
    def parse(self, response)

        item = MyspiderItem() # 实例化后可直接使用

        item[‘name‘] = node.xpath(‘./h3/text()‘).extract_first()
        item[‘title‘] = node.xpath(‘./h4/text()‘).extract_first()
        item[‘desc‘] = node.xpath(‘./p/text()‘).extract_first()
        
        print(item)

from myspider.items import MyspiderItem这一行代码中注意item的正确导入路径，忽略pycharm标记的错误

python中的导入路径要诀：从哪里开始运行，就从哪里开始导入

保存/清洗数据(pipelines)

管道能够实现数据的清洗和保存，能够定义多个管道实现不同的功能

保存数据

#### 一个爬虫





#### 多个爬虫

import json

from itemadapter import ItemAdapter
from pymongo import MongoClient

class ItcastspiderPipeline:
    def open_spider(self, spider):
        if spider.name == ‘itcast‘:
            self.file = open(‘./itcast.json‘, ‘w‘, encoding=‘utf-8‘)

    def process_item(self, item, spider):
        if spider.name == ‘itcast‘:
            # 将item对象强转成字典
            item = dict(item)
            json_data = json.dumps(item, ensure_ascii=False) + ‘,\n‘
            self.file.write(json_data)
        return item

    def close_spider(self, spider):
        if spider.name == ‘itcast‘:
            self.file.close()

class ItcspiderPipeline:
    def open_spider(self, spider):
        if spider.name == ‘itc‘:
            self.file = open(‘./itc.json‘, ‘w‘, encoding=‘utf-8‘)

    def process_item(self, item, spider):
        if spider.name == ‘itc‘:
            # 将item对象强转成字典
            item = dict(item)
            json_data = json.dumps(item, ensure_ascii=False) + ‘,\n‘
            self.file.write(json_data)
        return item

    def close_spider(self, spider):
        if spider.name == ‘itc‘:
            self.file.close()

class itMongoPipeline(object):
    def open_spider( self, spider ):
        if spider.name == ‘itcast‘:
            con = MongoClient()
            self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
        if spider.name == ‘itcast‘:
            # # 将item对象强转成字典 如果之前的item已经在pipeline中强转过已经是字典，就不需要再转换
            # item = dict(item)
            self.collection.insert(item)
        return item

开启管道：

在settings.py设置开启pipeline

......
ITEM_PIPELINES = {
   ‘itcastspider.pipelines.ItcastspiderPipeline‘: 300,  # 400表示权重,权重值越小，越优先执行！
   ‘itcastspider.pipelines.ItcspiderPipeline‘: 301,
   ‘itcastspider.pipelines.itMongoPipeline‘: 400,
}
......

注意点

使用之前需要在settings中开启。
pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义)，值表示距离引擎的远近，越近数据会越先经过：权重值小的优先执行
不同的pipeline可以处理不同爬虫的数据，通过spider.name属性来区分
不同的pipeline能够对一个或多个爬虫进行不同的数据处理的操作，比如一个进行数据清洗，一个进行数据的保存
同一个管道类也可以处理不同爬虫的数据，通过spider.name属性来区分
有多个pipeline的时候，process_item的方法必须return item,否则后一个pipeline取到的数据为None值
pipeline中process_item的方法必须有，否则item没有办法接受和处理
process_item(self,item,spider):实现对item数据的处理，接受item和spider，其中spider表示当前传递item过来的spider
如果item已经在pipelines中使用过已经是字典，就不需要再次转换，看是否被其他的先执行了主要看他的管道设置，管道数值越小表示它越优先执行。
open_spider(spider) :能够在爬虫开启的时候执行一次
close_spider(spider) :能够在爬虫关闭的时候执行一次
上述俩个方法经常用于爬虫和数据库的交互，在爬虫开启的时候建立和数据库的连接，在爬虫关闭的时候断开和数据库的连接

保存数据到MongoDB

itcast.py

......
 def parse(self, response):
        ...
	yield item  # 爬虫文件中需要yield给引擎，pipelines中才能拿到数据
......

pipelines.py

from pymongo import MongoClient

class MongoPipeline(object):
    def open_spider( self, spider ):
            con = MongoClient(host=‘127.0.0.1‘, port=27017)  # mongodb默认的host和post都是一样的，在本机可以省略host和port
            self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
            # # 将item对象强转成字典 
            # item = dict(item)   如果之前的item已经在pipeline中强转过已经是字典，就不需要再转换
            self.collection.insert(item)
        return item

在settings.py设置开启pipeline

......
ITEM_PIPELINES = {
    ‘itcastspider.pipelines.MongoPipeline‘: 500, # 权重值越小，越优先执行！  itcastspider是当前爬虫项目名
}
......

开启mongodb

? MongoDB-->bin-->双击mongodb.exe

查看mongodb是否存储成功

保存数据到MySQL

清洗数据

Scrapy实验项目

robots, ua实验

cookie实验

携带cookie参数登录gitee

1、创建gitee项目

scrapy startproject giteeproject

cd giteeproject
scrapy genspider giteespider

2、修改gitee项目

giteespider.py

import scrapy


class GiteeSpider(scrapy.Spider):
    name = ‘gitee‘
    # allowed_domains = [‘gitee.com‘]
    start_urls = [‘https://gitee.com/profile/account_information‘]
    
	# 重写start_requests方法
    def start_requests( self ):
        url = self.start_urls[0]
        temp = ‘登录后的gitee cookies字符串‘
        # 将cookies字符串遍历切成键值对形式
        cookies = {data.split(‘=‘)[0]: data.split(‘=‘)[-1] for data in temp.split(‘; ‘)}
        # 返回给引擎带cookies的请求
        yield scrapy.Request(
            url=url,
            callback=self.parse,  # 默认会调用parse方法，可以省略callback不写
            cookies=cookies 
        )

    def parse( self, response ):
        title = response.xpath(‘//div[@class="user-info"]/a/text()‘).extract_first()
        print(title)

settings.py

将 ROBOTSTXT_OBEY、USER_AGENT、LOG_LEVEL 解除注释并修改：

ROBOTSTXT_OBEY = False  # 不遵守Robots协议
USER_AGENT = ‘Mozilla/5.0‘ # UA伪装
LOG_LEVEL = "WARNING"  # 打印日志级别

其余的文件不用作修改

3、运行gitee项目

scrapy crawl giteespider

发送post请求登录github

实验网站:github登录网站

思路分析

进入github登录网站,F12打开开发者工具,Network --> Preserve log勾选上,点击sign in 按钮

可以看到是 https://github.com/session 携带用户名以及密码等相关参数在发送post请求

分析参数哪些有变动: 发现只有authenticity_token,timestamp,timestamp_secret这三个参数的值是变化的,其余都是不变的

获取参数值: 首先在页首找,发现这三个参数值都可以在login源码中获取

创建github爬虫项目

scrapy startproject githubProject

cd githubProject

scrapy genspider githubSpider github.com

完善代码

githubSpider.py中:

import scrapy


class GithubspiderSpider(scrapy.Spider):
    name = ‘githubSpider‘
    allowed_domains = [‘github.com‘]
    start_urls = [‘https://github.com/login‘]

    def parse( self, response ):
        # 在login源码中提取post需要携带的参数值
        authenticity_token = response.xpath(‘//input[@name="authenticity_token"]/@value‘).extract_first()
        timestamp = response.xpath(‘//input[@name="timestamp"]/@value‘).extract_first()
        timestamp_secret = response.xpath(‘//input[@name="timestamp_secret"]/@value‘).extract_first()
        # print(f‘{authenticity_token}\n{timestamp}\n{timestamp_secret}‘)
        yield scrapy.FormRequest(  # 用FormRequest发送请求
            ‘https://github.com/session‘,
            formdata={
                ‘commit‘: ‘Sign in‘,
                ‘authenticity_token‘: authenticity_token,
                ‘login‘: ‘你的github帐号‘,
                ‘password‘: ‘你的gihub帐号登录密码‘,
                ‘webauthn-support‘: ‘supported‘,
                ‘webauthn-iuvpaa-support‘: ‘supported‘,
                ‘timestamp‘: timestamp,
                ‘timestamp_secret‘: timestamp_secret,
            },
            callback=self.parse_login,
        )

    def parse_login( self, response ):
        if ‘email‘ in str(response.body):
            print(‘yes‘)
        else:
            print(‘error‘)

settings.py中修改添加对应的变量:

USER_AGENT = ‘Mozilla/5.0‘ # UA伪装
ROBOTSTXT_OBEY = False  # 不遵守Robot协议
LOG_LEVEL = "WARNING"  # 打印日志级别

运行github爬虫项目

scrapy crawl githubSpider

发送post请求登录gitee(未完)

ctrl+shift+n打开无痕浏览器,进入gitee登录页面,F12调出开发者工具,network-->把Preserve log勾选上

输入你的用户名和密码,点击登录按钮,观察开发者工具中network的变化,可以看到https://gitee.com/login发送post请求时携带用户名和密码,并进行了302跳转

退出登录,按之前的操作再重新登录一次,可以发现login中的authenticity_token和encrypt_data[user[password]]有变化

ip实验

items实验

pipeline实验

将itcast教师信息保存到mongodb

目标网站

源码:

itcast.py

import scrapy
from itcastspider.items import ItcastspiderItem

class ItcastSpider(scrapy.Spider):
    name = ‘itcast‘
    # allowed_domains = [‘itcast.cn‘]
    start_urls = [‘http://www.itcast.cn/channel/teacher.shtml#ajavaee‘]

    def parse(self, response):
        teachers = response.xpath(‘//div[@class="maincon"]/ul/li‘)
        for node in teachers:
            # temp={}
            item = ItcastspiderItem()
            item[‘name‘] = node.xpath(‘.//div[@class="main_bot"]//text()‘).extract()
            item[‘desc‘] = node.xpath(‘.//div[@class="main_mask"]//text()‘).extract()
            yield item

items.py

import scrapy

class ItcastspiderItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    title = scrapy.Field()
    desc = scrapy.Field()

pipelines.py

from itemadapter import ItemAdapter
from pymongo import MongoClient

class MongoPipeline(object):
    def open_spider( self, spider ):
        con = MongoClient()  # 本机中可省略host和port
        self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
        # 将item对象强转成字典
        item = dict(item)
        self.collection.insert(item)
        return item

settings.py

ROBOTSTXT_OBEY = False
LOG_LEVEL = "WARNING"

ITEM_PIPELINES = {
   ‘itcastspider.pipelines.MongoPipeline‘: 200,
}

保存数据到mysql

中间件实验

scrapy_redis实验

参考链接

scrapy官网

Scrapy爬虫，数据存入MongoDB

Python爬虫之Scrapy框架

原文：https://www.cnblogs.com/Merak21/p/14883315.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)

Python爬虫之Scrapy框架

Scrapy的命令

Scrapy框架常用命令

setings.py常用配置

Scrapy的概念

Scrapy的工作流程

scrapy各模块具体作用

Scrapy项目的结构

三个内置对象

五个组件

两个中间件

Scrapy项目开发流程

创建项目

创建爬虫

数据建模

中间件

爬虫文件(itcast.py)

保存数据

在settings.py配置启用管道

运行scrapy

Scrapy的使用

user-agent

cookie

ip

meta

翻页请求

数据建模(items)

保存/清洗数据(pipelines)

保存数据

注意点

保存数据到MongoDB

保存数据到MySQL

清洗数据

Scrapy实验项目

robots, ua实验

cookie实验

携带cookie参数登录gitee

发送post请求登录github

思路分析

创建github爬虫项目

完善代码

运行github爬虫项目

发送post请求登录gitee(未完)

ip实验

items实验

pipeline实验

将itcast教师信息保存到mongodb

源码:

保存数据到mysql

中间件实验

scrapy_redis实验

参考链接