首页 > 数据库技术 > 详细

scrapy ,mongoDB爬取各种类型书籍评价

时间:2020-07-19 23:01:56      阅读:99      评论:0      收藏:0      [点我收藏+]

整体效果:

技术分享图片

 

 技术分享图片

 

 

 

整体思路:

通过标签页的分类链接,获取全部书籍链接

第一步:调整settings文件

ROBOTSTXT_OBEY = False   #rebots协议关闭
DOWNLOAD_DELAY = 1  #下载延迟,尽量打开
DEFAULT_REQUEST_HEADERS = {
‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
‘Accept-Language‘: ‘en‘,
‘User-Agent‘:"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36"
}#加入请求头,伪装自己
新写一个start.py文件,用以开始scrapy服务,方便以后调试
from scrapy import cmdline

cmdline.execute("scrapy crawl douban".split())

第2步:在item里写入想要的字段

book_title = scrapy.Field()
book_link = scrapy.Field()
range_nums = scrapy.Field()
pl = scrapy.Field()

第3步:正式工作开始,爬取标签页包含的链接

start_urls = [‘https://book.douban.com/tag/?view=type&icn=index-sorttags-all‘]

起始链接:https://book.douban.com/tag/?view=type&icn=index-sorttags-all 包含所有标签链接。进行解析拼接想要的链接

def parse(self, response):
divs = response.xpath(‘//div[@class="article"]/div[2]‘)
for div in divs:
names = div.xpath(‘.//a/h2/text()‘)
# print(names)
trs = div.xpath(‘.//table[@class="tagCol"]/tbody‘)
for tr in trs:
tds = tr.xpath(‘.//tr‘)
for td in tds:
td_links = td.xpath(‘.//a/@href‘).extract()
for td in td_links:
detail_url = self.url + td
            yield scrapy.Request(url=detail_url, callback=self.parse_tag)#访问详情页

第4步:对详情页进行解析,获取想要的字段。通过item返回给管道。

def parse_tag(self,response):
# names = response.meta.get("info")
# print(response.url)
lis = response.xpath(‘//div[@id="subject_list"]/ul‘)
for li in lis:
book_title = li.xpath(‘.//div[@class="info"]/h2/a/@title‘).getall()
book_link = li.xpath(‘.//div[@class="info"]/h2/a/@href‘).getall()
range_nums = li.xpath(‘.//div[@class="star clearfix"]/span[2]/text()‘).getall()
pl = li.xpath(‘.//div[@class="star clearfix"]/span[3]/text()‘).getall()
pls = []
for i in range(len(pl)):
pls.append(pl[i].strip())
for a,b,c,d in zip(book_title,book_link,range_nums,pls):
item = DushuItem(book_link=b, book_title=a, range_nums=c, pl=d)
yield item
第5步:获取一页的链接进行判断并请求
next_url = response.xpath(‘//span[@class="next"]/a/@href‘).extract()
if next_url:
next_link = self.url + next_url[0]
yield scrapy.Request(url=next_link,callback=self.parse_tag)

第6部:存储item到mongoDB

开启mongoDB服务:

在cmd 中输入  mongod --dbpath="D:\MongoDB\db"

在pipeline写入:

from pymongo import MongoClient
from scrapy import Item
class MongoDBPipeline(object):

# 打开数据库
def open_spider(self, spider):
db_uri = spider.settings.get(‘MONGODB_URI‘, ‘mongodb://localhost:27017‘)
db_name = spider.settings.get(‘MONOGDB_DB_NAME‘, ‘scrapy_db‘)

self.db_client = MongoClient(db_uri)
self.db = self.db_client[db_name]

# 关闭数据库
def close_spider(self, spider):
self.db_client.close()

# 对数据进行处理
def process_item(self, item, spider):
self.insert_db(item)
return item

# 插入数据
def insert_db(self, item):
if isinstance(item, Item):
item = dict(item)
self.db.books.insert(item)

在settings加入
MONGODB_URI = ‘mongodb://127.0.0.1:27017‘
MONGODB_DB_NAME = ‘scrapy_db‘

第7步:打开pipeline
ITEM_PIPELINES = {
# ‘dushu.pipelines.DushuPipeline‘: 300,
‘dushu.pipelines.MongoDBPipeline‘: 300,

}

启动

 

scrapy ,mongoDB爬取各种类型书籍评价

原文:https://www.cnblogs.com/kkdadao/p/13341175.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!