【Python】爬虫之使用etree进行xpath元素定位

时间：2019-12-03 00:14:52 阅读：270 评论：0 收藏：0 [点我收藏+]

操作系统：macOS Mojave

python版本：python3.7

依赖库：requests、etree

关于依赖库的安装，建议使用anaconda+pycharm的组合方式，每个依赖库的安装又会基于其他依赖包的安装，这时候anaconda的作用便是自动帮你下载安装对应的依赖，不需要人工去查找，类似于java maven的三方库管理，python常见IDE就是pycharm了。pycharm怎么关联anaconda的依赖包呢？请看下图设置：

0-0、打开pycharm-preferences，进入设置

技术分享图片

0-1、选择anaconda所在的python执行文件

技术分享图片

1、网站源代码获取及转换

import requests
from lxml import etree

r=requests.get("http://www.baidu.com")
#print ("状态码：",r.status_code)
#print ("网站源代码",r.text)
#print ("头部请求",r.headers)

html = etree.HTML(r.text)  # 调用HTML类进行初始化
etreeResult = etree.tostring(html) # 将其转化为字符串类型，etree类型
strResult=etreeResult.decode(‘utf-8‘) #转化为utf-8编码格式，此时已是str类型

2、节点、属性值、内容的获取

语法如下：

技术分享图片

示例代码：

import requests
from lxml import etree

r=requests.get("http://www.baidu.com")
html = etree.HTML(r.text)  # 调用HTML类进行初始化

resultAll = html.xpath(‘//*‘)    #选取所有节点
#print("获取所有节点：",resultAll)
resultDivAll = html.xpath(‘//div‘)    #选取div子孙节点
#print("获取div所有节点：",resultDivAll)
resultDiv_img = html.xpath(‘//div/img‘)    #选取div下img节点
#print("获取div节点下img节点：",resultDiv_img)
resultDiv_imgSrc = html.xpath(‘//div/img/@src‘)    #获取div_img的src属性值
print("获取div节点下img的src值：",resultDiv_imgSrc)

对应输出的值：

技术分享图片

【Python】爬虫之使用etree进行xpath元素定位

原文：https://www.cnblogs.com/fightccc/p/10808590.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)