实例学习——爬取《斗破苍穹》全文小说

时间：2019-08-10 01:30:47 阅读：334 评论：0 收藏：0 [点我收藏+]

阅读前提：python基本语法

正则表达式

开发环境：（Windows）eclipse+pydev

爬取网址：www.doupoxs.com/doupocangqiong/

import requests
import re
import time

headers ={‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36‘}
     #加入请求头，增加爬虫稳定性
f = open(‘D:\Pyproject\doupo\doupo.txt‘,‘a+‘)      #新建txt文档，以追加方式打开

def get_info(url):                 #每一页面的文本爬取函数
    res = requests.get(url,headers = headers)
    if res.status_code == 200:                     #判断请求码是否为200，若是，则成功，不是，则失败
        contents = re.findall(‘<p>(.*?)</p>‘,res.content.decode(‘UTF-8‘),re.S)        #定义编码方式
        for content in contents:
            f.write(content+‘\n‘)                  #正则获取数据写入txt文件
    else:
        pass
    
if __name__ ==‘__main__‘:      
    urls = [‘http://www.doupoxs.com/doupocangqiong/{}.html‘.format(str(i)) for i in range(2,1665)]  #总爬取页数
    for url in urls:
        get_info(url)                          
        time.sleep(1)
f.close()                                         #关闭文档

结果展示：

技术分享图片

有关请求头获取方式等，见本人另一博文，不再赘述：https://www.cnblogs.com/junecode/p/11306266.html

实例学习——爬取《斗破苍穹》全文小说

原文：https://www.cnblogs.com/junecode/p/11330183.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)