bs4-爬取小说

时间：2019-08-24 19:54:04 阅读：103 评论：0 收藏：0 [点我收藏+]

bs4

bs4有两种运行方式一种是处理本地资源，一种是处理网络资源

本地

from bs4 import BeautifulSoup

if __name__ == '__main__':
    fr = open("wl.html",'r',encoding="utf8")
    soup=BeautifulSoup(fr,'lxml')
    print(soup)

网络

from bs4 import BeautifulSoup
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}

if __name__ == '__main__':
    url="https://www.cnblogs.com/zx125/p/11404486.html"
    res=requests.get(url=url,headers=headers)
    soup=BeautifulSoup(res.text,'lxml')
    print(soup)

实例化对象的方法

soup.tagname

直接返回第一个tag标签的内容

soup.find()

soup.find(tagname)效果和上面类似

soup.find(tagname,class_="")class_为tagname上的class内的属性

class_ id arr

双重定位属性定位但是只拿一个

soup.find_all()
用法和上面相同但是可以拿到满足条件的所有数据

soup.select()

select(‘某种选择器 #id .class 标签...‘),返回的是一个列表

它支持css的选择器

层级选择

soup.select(‘.zx > ul > li > a‘)一个>表示一个层级
soup.select(‘.zx > ul a‘)也可以这样写，一个空格代表以下的任意层级，并找到所有的a

获取标签的文本内容

soup.select(‘.zx > ul a‘).tagname.text/string/get_text()

text/get_text()获取标签下面所有的文本内容
string只获取直系的文本

获取标签中的属性值

.a["href"]

案例爬取小说标题和内容

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}
def work():
    url="http://www.shicimingju.com/book/sanguoyanyi.html"
    res=requests.get(url=url,headers=headers).text
    #读取首页信息
    soup=BeautifulSoup(res,"lxml")
    #获取所有标题存在的a标签
    titles=soup.select(".book-mulu > ul > li > a")
    with open("./sangup.txt","w",encoding="utf8")as fw:
        for i in titles:
            #获取标题名称
            title=i.text
            #获取文章内容的url，并拼接成有效的请求链接
            url_title="http://www.shicimingju.com"+i['href']
            res2=requests.get(url=url_title,headers=headers).text
            soup2=BeautifulSoup(res2,"lxml")
            #获取每个章节的文章内容
            content=soup2.find("div",class_="chapter_content").text
            context_all=title+"\n"+content+"\n"
            #将标题和文章内容写入本地文件
            fw.write(context_all)
            print(title+"写入成功")

if __name__ == '__main__':
    work()

bs4-爬取小说

原文：https://www.cnblogs.com/zx125/p/11405594.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)