刚开始学习python,对于在网上爬取数据,还处于死搬硬套代码的阶段。不废话,直接开始我的第一个爬取之旅。
1.创建项目
1)创建项目命令
scrapy startproject wooyun
该命令会在当前目录下创建一个wooyun文件夹
2)定义items.py
Scrapy提供了Item类,用来保存从页面爬取的数据。有点类似于Java中的反序列化,只不过反序列化是将字节流转化为Java对象,而Item是一个通用的类,通过key/value的形式存取数据。Item类中的所有字段通过 scrapy.Field() 来声明,声明的字段可以是任意类型,比如整数、字符串、列表等。
import scrapy class WooyunItem(scrapy.Item): commitDate = scrapy.Field() bugName = scrapy.Field() author = scrapy.Field()
3)我是将爬取的数据保存在mongodb数据库,所以在settings.py里面设置
#禁止cookies,防止被ban COOKIES_ENABLED = True ITEM_PIPELINES = { ‘wooyun.pipelines.WooyunPipeline‘:300 #管道下载优先级别1-1000 } MONGO_URI = "mongodb://localhost:27017/" MONGO_DATABASE = "local"
4)设置管道pipelines.py
# -*- coding: utf-8 -*- import datetime import pymongo # Define your item pipelines here # # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class DebugPipeline(object): now = datetime.datetime.now() collection_name = "wooyun_" + now.strftime(‘%Y%m%d‘) def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get(‘MONGO_URI‘), mongo_db=crawler.settings.get(‘MONGO_DATABASE‘, ‘items‘) ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert(dict(item)) return item
5)最后编写spiders,在里面定义想要爬取的数据
# -*- coding: utf-8 -*- import scrapy from debug.items import DebugItem import logging class debugSpider(scrapy.Spider): name = "debug" allowed_domains = ["wooyun.org"] start_urls = [ "http://www.wooyun.org/bugs/page/1", ] def parse(self,response): news_page_num = 20 if response.status == 200: for j in range(1,news_page_num+1): item = DebugItem() item[‘news_url‘] = response.xpath("//div[@class=‘content‘]/table[3]/tbody/tr["+str(j)+"]/td[1]/a/@href").extract() item[‘news_title‘] = response.xpath("//div[@class=‘content‘]/table[3]/tbody/tr["+str(j)+"]/td[1]/a/text()").extract() item[‘news_date‘] = response.xpath("//div[@class=‘content‘]/table[3]/tbody/tr["+str(j)+"]/th[1]/text()").extract() yield item for i in range(2,20): next_page_url = "http://www.wooyun.org/bugs/page/"+str(i) yield scrapy.Request(next_page_url,callback=self.parse_news) def parse_news(self,response): news_page_num = 20 if response.status == 200: for j in range(1,news_page_num+1): item = DebugItem() item[‘news_url‘] = response.xpath("//div[@class=‘content‘]/table[3]/tbody/tr["+str(j)+"]/td[1]/a/@href").extract() item[‘news_title‘] = response.xpath("//div[@class=‘content‘]/table[3]/tbody/tr["+str(j)+"]/td[1]/a/text()").extract() item[‘news_date‘] = response.xpath("//div[@class=‘content‘]/table[3]/tbody/tr["+str(j)+"]/th[1]/text()").extract() yield item
6)输入命令爬取
scrapy crawl wooyun
完成!!!!!!!!!!!
本文出自 “月中笙歌” 博客,请务必保留此出处http://727229447.blog.51cto.com/10866573/1744509
原文:http://727229447.blog.51cto.com/10866573/1744509