mooc python网络信息获取第二周

时间：2019-10-25 14:38:58 阅读：81 评论：0 收藏：0 [点我收藏+]

一、BeautifulSoup库

又叫bs4 BeautifulSoup4，是解析，遍历，维护标签树的功能库。

功能：对html，xml格式进行解析并提取信息，可进行树型解析。

安装：cmd ：pip install beautifulsoup4

A.解析HTML代码（两步）:

1.引用 from bs4 import BeautifulSoup

2爬取 l=requests.get("http://www.baidu.com") x=BeautifulSoup（需要解析的html信息，”html.parser“）其中html.parser为解析器

3.或者以本地文件形式打开 l=BeautifulSoup（open（“本地地址”））

3.最后显示bs4类形式的内容print（x.prettify()）

B、库解析器

1.bs4的html解析器---------BeautifulSoup（m，"html.parser"）安装bs4

2.lxml的html解析器--------BeautifulSoup（m，“lxml”）-----安装--pip install lxml

3.lxml的xml解析器--------BeautifulSoup（m，“xml”）-----安装--pip install lxml

4.html5lib解析器--------BeautifulSoup（m，“html5libl”）-----安装--pip install html5lib

C、基本元素

1.Tag 标签<></>

获取标签内容： s.title s.a s.p s.body s.style

2.Name 标签名如：‘p’ 格式： <tag>.name

获取标签名： s.a.name s.a.parent.name s.a.parent.parent.name

3.Attributes 标签属性，字典形式组织格式： <tag>.attrs

获得属性：s.a.attrs[‘class‘] s.a.attrs[‘href‘]

4.NavigableString非属性字符串《》中的《》 <tag>.string

s.p.string

5.comment注释部分

type(n.b.string)

D、标签树的3种遍历方法
1.下行遍历

.contents (子节点的列表，将所有子节点存入列表)

.children 子节点的迭代类型，与contents类似，用于循环遍历子节点

.descendants 子孙节点的迭代类型，用于循环遍历

2.上行遍历

.parent 节点父亲标签

.parents 先辈们的迭代类型用循环遍历

3.平行遍历

.next_sibling 返回按照html文本顺序的下一个平行点标签

.previous_sibling 返回按照html文本顺序的上一个平行点标签

.next_siblings 迭代类型返回按照html文本顺序的所有后续平行点标签

.previous-siblings 返回按照html文本顺序的所有前序平行点标签

二、信息组织提取方法

信息标记优点：形成组织结构，增加信息维度；用于通信，存储，展示；与信息一样重要；利于程序理解运用；

html(hyper text markup language)www的信息组织方式，将声音图像视频文本融合

A、国际信息标记基本种类

xml扩展标记语言以标签为主，最早的通用标记语言，可拓展性好，但繁琐；Internet上的信息交互与传递；

json 有类型的键值对，信息有类型，适合程序处理（js），较xml更简洁；程序接口时，移动应用云端和节点的信息通信，无注释；

yaml无类型键值对，信息无类型，文本信息比例最高，可读性好；各类型系统的配置文件；

B、信息提取一般方法

方法一：完整解析信息的标记形式，再提取关键信息，需要标记解析器如：bs4的标签树遍历

优点：信息解析准确；缺点：提取过程繁琐，速度慢

方法二：无视标记形式，直接搜索关键值信息，对信息文本查找函数

优点：提取过程简洁，速度较快；缺点：提取结果准确性和信息内容不准确

方法三：融合以上两种，需要标记解析器和文本查找函数

C、基于bs4的html查找方法

1.<>.find_all(name,attrs,recursive,string,**kwargs)

2.<>.find() 搜索只返回一个结果，字符串类型

3.<>find.parents（）在先辈节点中搜索，返回列表类型

4.<>find,parent() 先辈节点中返回一个结果，字符串类型

5.<>find_next_siblings()后续平行节点中返回多个结果，列表类型

6.<>find_next_sibling() 后续平行节点中返回一个结果字符串类型

7.<>find_previouos_siblings()前续平行节点中返回多个结果，列表类型

8.<>find_previous_sibling() 前续平行节点中返回一个结果字符串类型

D、提取百度主页链接案例

>>> l=requests.get("http://www.baidu.com")
>>> ll=l.text
>>> lll=BeautifulSoup(ll,"html.parser")
>>> llll=lll.prettify()

>>> for link in lll.find_all(‘a‘):
print(link.get(‘href‘))

三、实例

程序结构设计：

1.获取网页内容
2.提取内容中信息到合适的数据结构
3.利用数据结构展示并输出

代码：

#CrawUnivRankingA.py
import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find(‘tbody‘).children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr(‘td‘)
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))

def main():
    uinfo = []
    url = ‘https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html‘
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()

mooc python网络信息获取第二周

原文：https://www.cnblogs.com/hongyuz/p/11737851.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)