首页 > Web开发 > 详细

scrapy-scrapyd-scrapydweb

时间:2020-02-11 20:47:51      阅读:111      评论:0      收藏:0      [点我收藏+]

爬取的网站:

http://sohu.com/c/8/1463?spm=smpc.null.side-nav.16.1581303075427Zowrm4P

 

 

 最终文件结构

技术分享图片

 

 

scrapy 

首先在命令提示符中创建基本爬虫文件结构

技术分享图片

 

 

修改news\ news\spiders\souhunews.py文件:

# -*- coding: utf-8 -*-
import scrapy
from news.items import NewsItem
import sqlite3
import smtplib
from email.mime.text import MIMEText

class SouhunewsSpider(scrapy.Spider):
    name = souhunews
    allowed_domains = [sohu.com]
    start_urls = [‘‘‘http://sohu.com/c/8/1463?
                  spm=smpc.null.side-nav.16.1581303075427Zowrm4P‘‘‘]
   

    def parse(self, response):
        conn = sqlite3.connect(scrapy.db)
        c = conn.cursor()
        try:
            c.execute(DROP TABLE books)
        except:
            pass    
        
        c.execute(‘‘‘CREATE TABLE books(title text primary key,link text,fro text)‘‘‘)
        conn.commit()
        conn.close()
        
        it = []
        for x in response.xpath("//div[@data-role=‘news-item‘]"):
            item = NewsItem()
            item[link] = x.css("h4>a::attr(href)").get()
            item[title] = x.xpath("h4/a/text()").get()    
            item[fro] = x.xpath("div/span[@class=‘name‘]/a/text()").get()
            yield item

 

修改news\ news\items.py文件:

import scrapy
class NewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    link = scrapy.Field()#链接
    title = scrapy.Field()#标题
    fro = scrapy.Field()#来源

 

修改news\ news\pipelines.py文件:

import sqlite3
import smtplib
from email.mime.text import MIMEText

class NewsPipeline(object):
    def __init__(self):
        pass
        
    def process_item(self, item, spider):
        
        return item

class SQLitePipeline(object):

    #打开数据库
    def open_spider(self, spider):
        db_name = spider.settings.get(SQLITE_DB_NAME, scrapy.db)

        self.db_conn = sqlite3.connect(db_name)
        self.db_cur = self.db_conn.cursor()       

    #关闭数据库
    def close_spider(self, spider):
        self.db_conn.commit()
        self.db_conn.close()
        conn = sqlite3.connect(rE:\testpy\news\scrapy.db)
        
        c = conn.cursor()
        c.execute(‘‘‘SELECT * FROM books‘‘‘)
        data = c.fetchall()
        conn.commit()
        conn.close()

        txt = ‘‘
        for d in data:
            txt = txt+‘‘‘<p>标题:{}</p><p><a href = ‘{}‘>链接</a></p><p>来源:{}</p>‘‘‘.format(d[0],d[1],d[2])

        #邮件模块
        user = #########@qq.com
        pwd = #############‘#注意不是密码
        to = #############@qq.com

        msg = MIMEText(txt,html,utf-8)
        msg[Subject] = News
        msg[From] = user
        msg[To] = to

        s = smtplib.SMTP()
        s.connect(smtp.qq.com,25)
        s.login(user,pwd)
        s.sendmail(user,to,msg.as_string())
        s.quit()
        
    #对数据进行处理
    def process_item(self, item, spider):
        self.insert_db(item)
        return item

    #插入数据
    def insert_db(self, item):
        values = (
            (item[title][2:-2]).strip(),
            item[link][2:],
            item[fro]
        )

        sql = INSERT INTO books VALUES(?,?,?)
        self.db_cur.execute(sql, values)

 

在news\ news\settings.py文件中添加代码(是添加不是替换):

ITEM_PIPELINES = {
    news.pipelines.SQLitePipeline: 400,
}#对应管道中的SQLite数据库操作

SQLITE_DB_NAME = scrapy.db

最后命令提示符中输入

scrapy crawl souhunews

 

scrapyd

pip install scrapyd

pip install scrapyd-client

安装完,在命令提示符中启动scrapyd,然后打开default_scrapyd.conf文件修改里面参数 bind_address = 0.0.0.0

技术分享图片

 

 

 打开两个命令提示符端口,一个窗口县启动scrapyd,输入scrapyd命令(这一步很重要,不然后面会报Deploy failed错误)

修改news\ news\scrapy.cfg文件

[settings]
default = news.settings

[deploy:demo]
url = http://localhost:6800/
project = news

 

定位文件scrapyd-deploy文件位置,在其所在文件夹内,创建scrapyd-deploy.bat文件,内容是(里面位置信息自己修改)

@echo off

"D:\python\Anaconda\anaconda\python.exe" "D:\python\Anaconda\anaconda\Scripts\scrapyd-deploy" %1 %2 %3 %4 %5 %6 %7 %8 %9

技术分享图片

 

 

 命令提示符内定位到news文件夹,然后输入:

scrapyd-deploy demo -p souhunews

url http://localhost:6800/daemonstatus.json -d project=news -d spider = souhunews

技术分享图片

 

 

然后查看网页127.0.0.1:6800

技术分享图片

 

scrapydweb

确保scrapyd启动的情况下,在命令提示符中输入scrapydweb

修改news\scrapydweb_settings_v10.py,

技术分享图片

 

 命令提示符内重新输入:scrapydweb,然后访问http://127.0.0.1:5000

技术分享图片

 

 后面自己看:https://github.com/my8100/files/blob/master/scrapydweb/README.md

最后发现一个问题,每次邮件发的内容都一样,可能是数据库的原因。

参考:

 

https://www.jianshu.com/p/ddd28f8b47fb

https://www.jianshu.com/p/060ffe018491

https://www.cnblogs.com/du-jun/p/10515376.html

https://scrapyd.readthedocs.io/en/stable/

https://www.jianshu.com/p/1df101fe6408

https://github.com/my8100/files/blob/master/scrapydweb/README.md

scrapy-scrapyd-scrapydweb

原文:https://www.cnblogs.com/puddingsmall/p/12296415.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!