简单爬虫操作：1.简单爬取网页数据并输出 2.爬取数据打印到xls表格中

时间：2020-03-27 22:47:56 阅读：163 评论：0 收藏：0 [点我收藏+]

安装python环境参考菜鸟教程：

传送门：https://www.runoob.com/w3cnote/python-pip-install-usage.html

1..简单爬取网页数据并输出

import requests
from lxml import etree
import xlwt
# 获取源码
html = requests.get("https://www.ghpym.com/category/videos")
# 打印源码
#print (html.text)

etree_html = etree.HTML(html.text)   #将源码转化为能被 XPath 匹配的格式
#
#//*[@id="wrap"]/div/div/div/ul/li[1]/div[2]/h2/a/text()
content = etree_html.xpath(‘//*[@id="wrap"]/div/div/div/ul/li/div[2]/h2/a/@href‘)


for each in content:
    replace = each.replace(‘\n‘,‘‘).replace(‘ ‘,‘‘)       #去掉换行符和空格
    if replace ==‘\n‘ or replace == "":
        continue
    else:
     print (replace)
     
     



content = etree_html.xpath(‘//*[@id="wrap"]/div/div/div/ul/li/div[2]/h2/a/text()‘)

for each in content:
    replace = each.replace(‘\n‘,‘‘).replace(‘ ‘,‘‘)
    if replace ==‘\n‘ or replace == "":
        continue
    else:
     print (replace)

print("完成")

2.爬取数据打印到xls表格中

# coding:utf-8
from lxml import etree
import requests
import xlwt
title=[]
def get_film_name(url):
    html = requests.get(url).text #这里一般先打印一下 html 内容，看看是否有内容再继续。
    #print(html)
    s=etree.HTML(html) #将源码转化为能被 XPath 匹配的格式
    filename =s.xpath(‘//*[@id="wrap"]/div/div/div/ul/li/div[2]/h2/a/@href‘) #返回为一列表
    print (filename)
    title.extend(filename)
 
def get_all_film_name():
    for i in range(0, 250, 25):
        url = ‘https://www.ghpym.com/category/videos‘.format(i)
        get_film_name(url)
if ‘_main_‘:
    myxls=xlwt.Workbook()
    sheet1=myxls.add_sheet(u‘top250‘,cell_overwrite_ok=True)
    get_all_film_name()
    for i in range(0,len(title)):
        sheet1.write(i,0,i+1)
        sheet1.write(i,1,title[i])
    myxls.save(‘top250.xls‘)
    print("完成")

简单爬虫操作：1.简单爬取网页数据并输出 2.爬取数据打印到xls表格中

原文：https://www.cnblogs.com/jessezs/p/12584505.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)