007 Python网络爬虫与信息提取中国大学排名爬虫

时间：2020-11-19 22:04:32 阅读：39 评论：0 收藏：0 [点我收藏+]

[A] 中国大学排名定向爬虫实例介绍

　　功能描述

　　　　输入：大学排名URL链接

　　　　输出：大学排名信息的屏幕输出(排名，大学名称，总分)

　　　　技术路线：request，bs4

　　　　定向爬虫：仅对输入URL进行爬取，不拓展爬取

　　程序的结构设计：

　　　　步骤1：从网络上获取大学排名网页内容

　　　　　　　　定义函数：getHTMLText()

　　　　步骤2：提取网页内容中信息到合适的额数据结构

　　　　　　　　定义函数：fillUnivList()

　　　　步骤3：利用数据结构展示并输出结果

　　　　　　　　定义函数：printUnivList()

[B] 中国大学排名定向爬虫实例编写

　　　　定义了三个函数，分别用来 1. 获取，2. 保存和 3. 展示所爬取的结果

import requests
from bs4 import BeautifulSoup
import bs4

# 中国大学排名

# 1. 从url中获取所需html代码并返回
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ‘‘


# 2. 从获取到的html代码中解析出所需要的的数据保存在列表中并返回
def fillUnivList(html):
    ulist = []
    sublist = []
    soup = BeautifulSoup(html, ‘html.parser‘)
    text = soup.tbody
    for tr in soup.find(‘tbody‘).children:
        tds = tr(‘td‘)
        for item in tds:
            sublist.append(item.string)
        ulist.append(sublist)
        sublist = []
    return ulist


# 3. 根据输入的信息，按要求打印出相应数据
def printUnivList(ulist, start, end):
    print(‘{:^9}{:^12}{:^15}‘.format(‘排名‘, ‘学校名称‘, ‘分数‘))
    for i in range(start, end+1):
        print(‘{:^10}{:^12}{:^15}‘.format(ulist[i][0], ulist[i][1], ulist[i][2]))


# 主程序
def main():
    url = ‘http://www.gaosan.com/gaokao/299262.html‘
    html = getHTMLText(url)
    uinfor = fillUnivList(html)
    printUnivList(uinfor, 5, 30)


# 运行主程序
main()

View Code

[C] 中国大学排名定向爬虫实例优化

007 Python网络爬虫与信息提取中国大学排名爬虫

原文：https://www.cnblogs.com/carreyBlog/p/14008062.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)

007 Python网络爬虫与信息提取 中国大学排名爬虫

007 Python网络爬虫与信息提取中国大学排名爬虫