python爬虫（十三） lxml模块

时间：2020-02-29 21:12:12 阅读：86 评论：0 收藏：0 [点我收藏+]

lxml是一个HTML/XML的解析库，主要功能是如何解析和提取HTML/XML数据

lxml和正则一样，是用c实现的，我们可以用XPath语法，来快速的定位特定元素以及节点信息。需要用到pip。

使用：

1、解析一段html的字符串

from lxml import etree

text="""

# 一段html代码
"""

htmlElement=etree.HTML(text)
print(etree.tostring(htmlElement,encoding=‘utf-8‘).decode(‘utf-8‘))

使用etree.HTML（）

不需要解析器

2、解析一个html代码的文件

htmlElement=etree.parse("xxx.html")
print(etree.tostring(htmlElement,encoding=‘utf-8‘).decode(‘utf-8‘))

使用etree.parse("xxx.html")

但是这个方法不能处理一些不规范的标签

所以要加一行解析器：parser=etree.HTMLParser(encoding=‘utf-8‘)

from lxml import etree



parser=etree.HTMLParser(encoding=‘utf-8‘)
htmlElement=etree.parse("lagou.html",parser=parser)

print(etree.tostring(htmlElement,encoding=‘utf-8‘).decode(‘utf-8‘))

结果：

技术分享图片

python爬虫（十三） lxml模块

原文：https://www.cnblogs.com/zhaoxinhui/p/12386010.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)