Scrapy框架(三)--全站数据爬取

时间：2020-06-30 21:59:48 阅读：81 评论：0 收藏：0 [点我收藏+]

scrapy基于Spider类的全站数据爬取

大部分的网站展示的数据都进行了分页操作，那么将所有页码对应的页面数据进行爬取就是爬虫中的全站数据爬取。
基于scrapy如何进行全站数据爬取呢？
1.将每一个页码对应的url存放到爬虫文件的起始url列表（start_urls）中。（不推荐）
2. 使用Request方法手动发起请求。（推荐）
需求：爬取校花网中的照片的名称

# -*- coding: utf-8 -*-
import scrapy


class XiahuaSpider(scrapy.Spider):
    name = ‘xiahua‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘http://www.521609.com/daxuemeinv/‘]

    url = ‘http://www.521609.com/daxuemeinv/list8%s.html‘ # 设定一个url模板
    page_num = 2
    def parse(self, response):
        li_list = response.xpath(‘//*[@id="content"]/div[2]/div[2]/ul/li‘)
        for li in li_list:
            img_name = li.xpath(‘./a[1]/img/@alt‘).extract_first()
            print(img_name)
        if self.page_num <= 23: # 设总共23页
            new_url = format(self.url%self.page_num)
            self.page_num+=1
            yield scrapy.Request(url=new_url,callback=self.parse) # 手动发起请求 数据解析在callback指定的函数中进行

Scrapy框架(三)--全站数据爬取

原文：https://www.cnblogs.com/sxy-blog/p/13215776.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)