首页 > 其他 > 详细

scrapy爬虫框架(二)

时间:2020-02-16 23:40:19      阅读:116      评论:0      收藏:0      [点我收藏+]

scrapy爬虫框架(二)

将数据保存到json文件中

settings.py打开pipeline,其中数字代表优先级(值越小优先级越高)

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'qsbkSpider.pipelines.QsbkspiderPipeline': 300,
}

qsbk.py

# -*- coding: utf-8 -*-
import scrapy


class QsbkSpider(scrapy.Spider):
    name = 'qsbk'
    allowed_domains = ['www.yicommunity.com']
    start_urls = ['http://www.yicommunity.com/']

    def parse(self, response):
        print("=" * 80)
        contents = response.xpath('//div[@class="col1"]/div')
        print(contents)
        print("=" * 80)
        for content in contents:
            author = content.xpath("./div[@class='author']/text()").get()
            word = content.xpath("./div[@class='content']/text()").get()
            print(author, word)
            duanzi = {"author": author, "word": word}
            # 从函数变成生成器,遍历生成器的时候就会一个一个返回回去
            yield duanzi  # 移交给引擎,引擎再移交给pipeline

pipeline.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json


class QsbkspiderPipeline(object):
    def __init__(self):  # 初始化方法
        self.fp = open("duanzi.json", "w", encoding='utf-8')

    def process_item(self, item, spider):
        item_json = json.dumps(item)
        self.fp.write(item_json + '\n')
        return item

    def open_spider(self, spider):
        print("爬虫开始了!")

    def close_spider(self, spider):
        self.fp.close()
        print("爬虫结束了!")

运行效果

scrapy crawl qsbk

技术分享图片

同时生成文件

技术分享图片


优化

技术分享图片

技术分享图片

技术分享图片

技术分享图片

scrapy爬虫框架(二)

原文:https://www.cnblogs.com/senup/p/12319119.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!