增量式爬虫案例

时间：2021-05-16 22:15:41 阅读：28 评论：0 收藏：0 [点我收藏+]

一、增量式爬虫：检测网站数据更新情况，只爬取网站最近更新出来的数据。

核心思路：将爬取过的详情url存储到redis的set集合。

爬虫文件：

# -- coding: utf-8 --

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from redis import Redis

from incrementPro.items import IncrementproItem

class MovieSpider(CrawlSpider):

? name = ‘movie‘

? # allowed_domains = [‘www.xxx.com‘]

? start_urls = [‘http://www.4567tv.tv/frim/index7-11.html‘]

? rules = (

? Rule(LinkExtractor(allow=r‘/frim/index7-\d+.html‘), callback=‘parse_item‘, follow=True),

? )

? #创建redis链接对象

? conn = Redis(host=‘127.0.0.1‘,port=6379)

? def parse_item(self, response):

? li_list = response.xpath(‘//li[@class="p1 m1"]‘)

? for li in li_list:

? #获取详情页的url

? detail_url = ‘http://www.4567tv.tv‘+li.xpath(‘./a/@href‘).extract_first()

? #将详情页的url存入redis的set中

? ex = self.conn.sadd(‘urls‘,detail_url)

? if ex == 1:

? print(‘该url没有被爬取过，可以进行数据的爬取‘)

? yield scrapy.Request(url=detail_url,callback=self.parst_detail)

? else:

? print(‘数据还没有更新，暂无新数据可爬取！‘)

? #解析详情页中的电影名称和类型，进行持久化存储

? def parst_detail(self,response):

? item = IncrementproItem()

? item[‘name‘] = response.xpath(‘//dt[@class="name"]/text()‘).extract_first()

? item[‘kind‘] = response.xpath(‘//div[@class="ct-c"]/dl/dt[4]//text()‘).extract()

? item[‘kind‘] = ‘‘.join(item[‘kind‘])

? yield item

管道文件：

# -- coding: utf-8 --

# Define your item pipelines here

# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from redis import Redis

class IncrementproPipeline(object):

? conn = None

? def open_spider(self,spider):

? self.conn = Redis(host=‘127.0.0.1‘,port=6379)

? def process_item(self, item, spider):

? dic = {

? ‘name‘:item[‘name‘],

? ‘kind‘:item[‘kind‘]

? }

? print(dic)

? self.conn.lpush(‘movieData‘,dic)

? return item

增量式爬虫案例

原文：https://www.cnblogs.com/ajiling/p/14774363.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)