6-爬虫-

时间：2020-07-09 15:47:09 阅读：58 评论：0 收藏：0 [点我收藏+]

scrapy如何进行图片数据(二进制数据)爬取

1、在爬虫文件中解析出图片地址+图片名称封装到item对象提交给管道
2、在管道文件中：
　　- from scrapy.pipelines.images import ImagesPipeline
　　- 封装一个管道类，继承与ImagesPipeline
　　- 重写父类的三个方法：
　　　　- get_media_requests
　　　　- file_path：只需要返回图片名称
　　　　- item_completed
3、在配置文件中添加如下配置：
　　- IMAGES_STORE = ‘文件夹路径‘

xiaohua.py

# -*- coding: utf-8 -*-
import scrapy
from xiaohuaPro.items import XiaohuaproItem

class XiaohuaSpider(scrapy.Spider):
    name = ‘xiaohua‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘http://www.521609.com/daxuemeinv/‘]

    def parse(self, response):
        #图片地址+名称
        li_list = response.xpath(‘//*[@id="content"]/div[2]/div[2]/ul/li‘)
        for li in li_list:
            img_src = ‘http://www.521609.com‘+li.xpath(‘./a[1]/img/@src‘).extract_first()
            img_name = li.xpath(‘./a[1]/img/@alt‘).extract_first()+‘.jpg‘
            item = XiaohuaproItem()
            item[‘img_name‘] = img_name
            item[‘img_src‘] = img_src

            yield item

pipelines.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline


class XiaohuaproPipeline(ImagesPipeline):
    # 对图片数据进行请求发送
    # 该方法参数item就是接受爬虫文件提交过来的item
    def get_media_requests(self, item, info):
        # meta可以将字典传递给file_path方法
        yield scrapy.Request(item[‘img_src‘], meta={‘item‘: item})

    # 指定图片存储的路径
    def file_path(self, request, response=None, info=None):
        # 如何获取图片名称
        item = request.meta[‘item‘]
        img_name = item[‘img_name‘]
        return img_name

    # 可以将item 传递给下一个即将被执行的管道类
    def item_completed(self, results, item, info):
        return item

6-爬虫-

原文：https://www.cnblogs.com/wgwg/p/13273967.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)