还未解决网页超时问题,先放着,爬到一半就没了。
首先看见自己喜欢的图片,忍不住要想下载,一个一个下又很麻烦,只好请求爬虫大大帮助啦。
https://www.ivsky.com/bizhi/code_geass_t1300/
网页分析:
首先;每一页都对应着,很多的图片
所以我们得先找到没一页对应得url,右键检查
发现对应得页数是以一个单位递增,所以我们已经找到所以页数得url
然后我们要提取,对应页图片下得url。
就会找到对应结点下得图片url,然后进入图片url
同样右键检查。
然后就可以发现对应图片得地址
然后就可以直接写代码了
1 import os 2 import requests 3 from lxml import etree 4 from urllib.request import urlopen, Request 5 6 7 class PNimag(): 8 def __init__(self): 9 self.base_url = ‘https://www.ivsky.com/bizhi/code_geass_t1300/index_1.html‘ 10 self.headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"} 11 def get_html(self,url): 12 response = requests.get(url, self.headers,timeout = 5) 13 if response.status_code == 200: 14 response.encoding = response.apparent_encoding 15 return response.text 16 return None 17 def get_url(self,html): 18 url_2 = [] 19 x_html = etree.HTML(html) 20 url_2 = x_html.xpath(‘//ul[@class="il"]/li/div[@class="il_img"]/a/@href‘) 21 return url_2 22 def get_image(self,html): 23 x_html = etree.HTML(html) 24 ima_url = x_html.xpath(‘//img[@id="imgis"]/@src‘) 25 return ima_url 26 def save_image(self,url,name): 27 req = Request(url=url,headers = self.headers) 28 content = urlopen(req).read() 29 with open("C:/Users/25766/AppData/Local/Programs/Python/Python38/imgs/LC/LC"+name,‘wb‘) as f: 30 f.write(content) 31 print(name,‘finsh...‘) 32 33 url = "https://www.ivsky.com/bizhi/code_geass_t1300/index_" 34 bian = PNimag() 35 p = 2000 36 for i in range(20,28): 37 url2 = url + str(i) + ‘.html‘ 38 html = bian.get_html(url2) 39 lis = bian.get_url(html) 40 for j in lis: 41 s = ‘https://www.ivsky.com‘ + j 42 l = bian.get_html(s) 43 k = bian.get_image(l) 44 k[0] = ‘https:‘ + k[0] 45 print(k[0]) 46 bian.save_image(k[0],str(p) + ‘.jpg‘) 47 p += 1 48 49
运行结果
方便多了
原文:https://www.cnblogs.com/rstz/p/12581124.html