首页 > 其他 > 详细

爬虫--XPATH解析

时间:2019-03-08 10:44:02      阅读:315      评论:0      收藏:0      [点我收藏+]

今天说一下关于爬取数据解析的方式---->XPATH,XPATH是解析方式中最重要的一种方式

1.安装:pip install lxml

 2.原理

  1. 获取页面源码数据

  2.实例化一个etree的对象,并且将页面源码数据加载到该对象中

  3.调用该对象的xpath方法进行指定标签的定位

  4.注意:xpath函数必须结合着xpath表达式进行标签定位和内容捕获

说了也不明白,直接上例子!!!!

1.解析58二手房的相关数据

 

技术分享图片
#引用requests
import requests
#引用lxml
from lxml import etree
#地址
url = https://bj.58.com/ershoufang/sub/l16/s2242/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.bdpcpz_bt&PGTID=0d30000c-0000-1139-b00c-643d0d315a04&ClickID=1
#伪装的请求头,证明我是浏览器
headers = {
    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
}
#获取当前整个页面
page_text = requests.get(url,headers=headers).text
#作用于当前页面之后再解析数据
tree = etree.HTML(page_text)
#//ul[@class="house-list-wrap"]/li这就是lxml解析,//代表前面的层次
li_list = tree.xpath(//ul[@class="house-list-wrap"]/li)
# print(li_list)#得到每一个<Element li at 0x202a8c62288>这玩意
#再次循环
for li in li_list:
#再次解析得到准确的数据!!!
    title = li.xpath(./div[2]/h2[1]/a/text())[0]
    print(title)
View Code

 

2.福利福利!!!!下载彼岸图网中的图片数据

技术分享图片
import os
import requests
from lxml import etree
#这里注意,这是python3中的写法!!!
import urllib.request
url = http://pic.netbian.com/4kmeinv/
headers = {
    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
}
#二话不说直接发情再说
response = requests.get(url,headers=headers)

#如果当前根目录下没有imgs文件夹,就创建!!!
if not os.path.exists(./imgs):
    os.mkdir(./imgs)
    
#得到请求数据
page_text = response.text
#作用当前页面
tree = etree.HTML(page_text)
#lxml解析
li_list = tree.xpath(//div[@class="slist"]/ul/li)
#循环得到准确的数据
for li in li_list:
    img_name = li.xpath(./a/b/text())[0]
    # 处理中文乱码!不要理解记住就ok
    img_name = img_name.encode(ISO-8859-1).decode(gbk)
    #拼接完整的地址
    img_url = http://pic.netbian.com + li.xpath(./a/img/@src)[0]
    #图片的名字
    img_path = ./imgs/ + img_name + .jpg
    #这里避免打开文件就用urllib直接写入
    urllib.request.urlretrieve(url=img_url,filename=img_path)
View Code

3.解析所有城市名称(https://www.aqistudy.cn/historydata/)

技术分享图片
import requests
from lxml import etree

url = https://www.aqistudy.cn/historydata/

headers = {
    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
}

res = requests.get(url=url, headers=headers).text
tree = etree.HTML(res)
city_list = tree.xpath(//div[@class="bottom"]/ul/li/a/text() | //div[@class="bottom"]/ul/div[2]/li/a/text())        # 逻辑运算符,这里 | 表示或的关系
city = ‘‘.join(city_list)
View Code

4.煎蛋网的爬去图片

技术分享图片
# 煎蛋网图片
import requests
from lxml import etree
#base对于加魔数据进行解密
import base64
import os
import urllib.request

if not os.path.exists(./jiandan):
    os.mkdir(./jiandan)

headers = {
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
}
url = http://jandan.net/ooxx

res1 = requests.get(url, headers=headers).text
tree = etree.HTML(res1)

span_list = tree.xpath(//span[@class="img-hash"]/text())
for span_hash in span_list:
    #对于加密数据进行解密,编码是utf-8并且拼接完整的url
    img_url = http: + base64.b64decode(span_hash).decode(utf8)
    #得到具体的数据
    img_data = requests.get(url=img_url, headers=headers).content

    filepath = ./jiandan/ + img_url.split(/)[-1]
    urllib.request.urlretrieve(url=img_url, filename=filepath)
    print(filepath, 下载完成!)

print(over)
View Code

5.爬去简历模板

技术分享图片
import requests
from lxml import etree
import random
import os

headers = {
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
}

if not os.path.exists(./jianli):
    os.mkdir(./jianli)
#爬去前4页的数据
for i in range(1, 4):
    if i == 1:
        #第一页数据
        url = http://sc.chinaz.com/jianli/free.html
    else:
        #不是第一页的数据,进行数据的拼接
        url = http://sc.chinaz.com/jianli/free_%s.html % (i)

    response = requests.get(url=url, headers=headers)
    #字符编码改一下,否则出现这种问题:æ±?è??ç?µå­?ç??ç®?å??å??è´¹ä¸?è½½ 下载完成!
    response.encoding = utf8

    res = response.text

    tree = etree.HTML(res)

    a_list = tree.xpath(//a[@class="title_wl"])
    for a in a_list:
        name = a.xpath(./text())[0]
        jl_url = a.xpath(./@href)[0]

        response = requests.get(url=jl_url, headers=headers)
        response.encoding = utf8
        res1 = response.text
        tree = etree.HTML(res1)
        download_url_list = tree.xpath(//div[@class="clearfix mt20 downlist"]/ul/li/a/@href)
        download_url = random.choice(download_url_list)

        res3 = requests.get(url=download_url, headers=headers).content

        filepath = ./jianli/ + name + .rar
        #如果上边是content,写入的时候记得’wb‘
        with open(filepath, wb) as f:
            f.write(res3)
        print(name, 下载完成!)

print(over)
View Code

6.站长直接图片下载,图片懒加载

技术分享图片
import requests
from lxml import etree
import os
import urllib
import urllib.request

if not os.path.exists(./tupian):
    os.mkdir(./tupian)

url = http://sc.chinaz.com/tupian/

headers = {
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
}

response = requests.get(url=url, headers=headers)
response.encoding = utf8
res = response.text
tree = etree.HTML(res)
url_list = tree.xpath(//div[@id="container"]/div/div/a/img/@src2)  # img标签是伪属性src2,当图片滚动到视野内时变为 src

for url in url_list:
    filepath = ./tupian/ + url.rsplit(/, 1)[-1]
    urllib.request.urlretrieve(url, filepath)
    print(filepath, 下载完成!)

print(over)
View Code

 

爬虫--XPATH解析

原文:https://www.cnblogs.com/lzqrkn/p/10494351.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!