Python 高级应用程序设计任务

时间：2019-12-20 02:13:14 阅读：118 评论：0 收藏：0 [点我收藏+]

一，主题式网络爬虫设计方案（15分）

1，主题式网络爬虫的名称

1.1网易新闻网站的爬取

2，主题式网络爬虫的内容与数据特征分析

2.1爬虫的内容

标题，链接，日期，点击数，来源，内容

2.2 数据特征分析

2.2.1对点击数做一个折线图

3，主题式网络爬虫设计方案概述（包括实现思路和技术难点）

3.1实现思路

创建一个新闻类，定义master()方法作为启动程序，pathlib.Path的方法检查excel文件是否存在，如果存在直接读取进行数据分析, 网站内容使用requests 和 beautifulsoup进行抓取,具体如下图解。

技术分享图片

3.2技术难点

爬取过程中并未遇到阻拦，既不需要设置header, 也没遇到在爬取过程中被重定向到登录页面（整个爬取5-6分钟）。

二，主题页面的结构特征分析（15分）

1，主题页面的特征结构

主要爬取本周点击的排行数据，由于网页长度有限，每篇帖子的文章名长度有限，需要通过连接去爬取实际的网站标题，并且直接抓取新闻内容

技术分享图片

2，HTML页面解析

<span class="team_name"> 球队标签

技术分享图片

这部分是截取的部分球员数据和球队标签

3，节点（标签）查找方法与遍历发法（必要时画出节点数结构）

查找节点的方法采用beautifoulSoup的元素选择器，通过find,select等内置方法来来提取所需要的数据。从整体(tbody)到部分(tr)的查找方式，即先确定爬取的数据所在哪个html的节点中，找到这个节点的所有直接子节点，也就是每一个攻略项，再用for循环依次遍历，然后再具体解析遍历的每一项攻略的数据，图解如下。

技术分享图片

三，网络爬虫程序设计（60分）

1，爬虫程序主题要包括以下部分，要附源代码及较详解注释，并在每部分程序后面提供输出结果的截图。

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import openpyxl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def show():
    df = pd.read_excel("网易新闻数据.xlsx")
    click = df["点击数"]
    click.plot(kind="line")
    plt.show()



def write_to_excel(news):
    book_name = "网易新闻数据.xlsx"
    sheet_name = "网易新闻"
    columns = ["标题", "链接", "日期", "点击数", "来源", "内容"]
    # 初始化excel
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    # 设置sheet的名称
    sheet.title = sheet_name
    index = 1
    # 插入标题栏
    for i in range(0, len(columns)):
        sheet.cell(index, i + 1, columns[i])

    for key, value in news.items():
        index += 1
        sheet.cell(index, 1, value.title)
        sheet.cell(index, 2, value.href)
        sheet.cell(index, 3, value.time)
        sheet.cell(index, 4, value.click)
        sheet.cell(index, 5, value.source)
        sheet.cell(index, 6, value.content)

    workbook.save(book_name)


class News:
    def __init__(self, href, click):
        self.href = href
        self.click = click
        self.title = ""
        self.source = ""
        self.time = ""
        self.content = ""

    def set_source(self, source):
        self.source = source

    def set_time(self, time):
        self.time = time

    def set_content(self, content):
        self.content = content

    def set_title(self, title):
        self.title = title

    # 获取网页内的内容
    def get_content(self):
        resp = requests.get(self.href)
        if resp.status_code != 200:
            print(self.title + " 页内容爬取失败, 链接地址:" + self.href)
            return

        ct = resp.headers["Content-Type"].split(‘charset=‘)[1].lower()
        bs4 = BeautifulSoup(resp.content, features="html.parser", from_encoding=ct)
        content = bs4.find("div", "post_content_main")
        title = content.find("h1").get_text()
        self.set_title(title)
        sourceDom = content.find("div", class_="post_time_source")
        # 获取迭代器
        sg = sourceDom.stripped_strings
        tg = next(sg)
        self.set_time(tg.split("　来源:")[0])
        sd = next(sg)
        self.set_source(sd)
        pt = bs4.find("div", class_="post_text").get_text()
        self.set_content(pt)


# ======================主流程=========================
def master():
    baseUrl = "http://news.163.com/special/0001386F/rank_sports.html"
    cntDom = "tabContents active"

    response = requests.get(baseUrl)
    # 获取网页的编码格式, 根据编码格式初始化bs4
    ct = response.headers["Content-Type"].split(‘ charset=‘)[1].lower()

    if response.status_code != 200:
        print("response code err", response.status_code)
        exit(1)

    body = response.content
    bs4 = BeautifulSoup(body, features="html.parser", from_encoding=ct)
    cnt = bs4.find(‘div‘, cntDom).find_all("tr")

    news_array = {}
    count = 0
    for item in cnt:
        count += 1
        exist = item.find("td", class_="red")
        # class=red 的元素不存在, 尝试获取gray的标签元素
        if not exist:
            exist = item.find("td", class_="gray")
            # class=gray 的元素也不存在则查找rank的标签
            if not exist:
                exist = item.find("td", class_="rank")
                if not exist:
                    continue

        aDom = exist.find("a", href=True)
        # 标题
        title = aDom.get_text()
        # 获取跳转链接
        href = aDom["href"]
        # 点击数
        clickDom = item.find("td", class_="cBlue")
        if not clickDom:
            click = 0
        else:
            click = clickDom.get_text()

        new = News(href, click)
        print("开始爬取 %s;第%d条" % (title, count))
        new.get_content()
        news_array[new.title] = new

    # 写入excel文件
    write_to_excel(news_array)


# excel文件读取以及分析流程


if __name__ == "__main__":
    show()

2. 对数据进行清洗

2.1. 读取excel文件中的数据

df = pd.read_excel("网易新闻数据.xlsx")

2.2 遍历所有的数据

resp = requests.get(self.href)
if resp.status_code != 200:
    print(self.title + " 页内容爬取失败, 链接地址:" + self.href)
    return

ct = resp.headers["Content-Type"].split(‘charset=‘)[1].lower()
bs4 = BeautifulSoup(resp.content, features="html.parser", from_encoding=ct)
content = bs4.find("div", "post_content_main")
title = content.find("h1").get_text()
self.set_title(title)
sourceDom = content.find("div", class_="post_time_source")
# 获取迭代器
sg = sourceDom.stripped_strings
tg = next(sg)
self.set_time(tg.split("　来源:")[0])
sd = next(sg)
self.set_source(sd)
pt = bs4.find("div", class_="post_text").get_text()
self.set_content(pt)

2.4 数据清理

count += 1
exist = item.find("td", class_="red")
# class="red" 的元素不存在, 尝试获取gray的标签元素
if not exist:
    exist = item.find("td", class_="gray")
    # class="gray" 的元素也不存在则查找rank的标签
    if not exist:
        exist = item.find("td", class_="rank")
        if not exist:
            continue

2.5 画散点图

df = pd.read_excel("网易新闻数据.xlsx")
click = df["点击数"]
click.plot(kind="line")
plt.show()

3.文本分析（可选）：jieba分词、wordcloud可视化

4.数据分析与可视化

4.1 根据球员的身高体重绘制散点图

df = pd.read_excel("网易新闻数据.xlsx")
click = df["点击数"]
click.plot(kind="line")
plt.show()

具体截图如下：

技术分享图片

5.数据持久化

写入csv文件

技术分享图片

四、结论（10分）
1.经过对主题数据的分析与可视化，可以得到哪些结论？

1.1 大家比较喜欢看前几位的帖子

1.2 大部分都是网易体育家自己发的帖子

2.对本次程序设计任务完成的情况做一个简单的小结。

本次作业, 通过运用了学习到的爬虫和数据分析的知识分析了网易新闻的相关情况，为自己以后工作奠定了基础，同时也在思考了之后比如在更加多元话的数据分析，以及爬虫深度与速度的改进

Python 高级应用程序设计任务

原文：https://www.cnblogs.com/jinxiating/p/12070628.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)