作为一名从小就看篮球的球迷,会经常逛虎扑篮球及湿乎乎等论坛,在论坛里面会存在很多精美图片,包括NBA球队、CBA明星、花边新闻、球鞋美女等等,如果一张张右键另存为的话真是手都点疼了。作为程序员还是写个程序来进行吧!
所以我通过Python+Selenium+正则表达式+urllib2进行海量图片爬取。
前面讲过太多Python爬虫相关的文章了,如爬取新浪博客、维基百科Infobox、百度百科、游迅网图片,也包括Selenium安装过程等等,详见我的两个专栏:
Python学习系列
Python爬虫之Selenium+Phantomjs+CasperJS
运行效果如下图所示,其中第一幅图是虎扑网站爬取tag(标签)为马刺的图集,第二幅图是爬取tag为陈露的图集。每个文件夹命名对应网页主题,而且图片都是完整的。
http://photo.hupu.com/nba/tag/马刺
http://photo.hupu.com/nba/tag/陈露
1 # -*- coding: utf-8 -*- 2 """ 3 Crawling pictures by selenium and urllib 4 url: 虎扑 马刺 http://photo.hupu.com/nba/tag/%E9%A9%AC%E5%88%BA 5 url: 虎扑 陈露 http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2 6 Created on 2015-10-24 7 @author: Eastmount CSDN 8 """ 9 10 import time 11 import re 12 import os 13 import sys 14 import urllib 15 import shutil 16 import datetime 17 from selenium import webdriver 18 from selenium.webdriver.common.keys import Keys 19 import selenium.webdriver.support.ui as ui 20 from selenium.webdriver.common.action_chains import ActionChains 21 22 #Open PhantomJS 23 driver = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe") 24 #driver = webdriver.Firefox() 25 wait = ui.WebDriverWait(driver,10) 26 27 #Download one Picture By urllib 28 def loadPicture(pic_url, pic_path): 29 pic_name = os.path.basename(pic_url) #删除路径获取图片名字 30 pic_name = pic_name.replace(‘*‘,‘‘) #去除‘*‘ 防止错误 invalid mode (‘wb‘) or filename 31 urllib.urlretrieve(pic_url, pic_path + pic_name) 32 33 34 #爬取具体的图片及下一张 35 def getScript(elem_url, path, nums): 36 try: 37 #由于链接 http://photo.hupu.com/nba/p29556-1.html 38 #只需拼接 http://..../p29556-数字.html 省略了自动点击"下一张"操作 39 count = 1 40 t = elem_url.find(r‘.html‘) 41 while (count <= nums): 42 html_url = elem_url[:t] + ‘-‘ + str(count) + ‘.html‘ 43 #print html_url 44 ‘‘‘ 45 driver_pic.get(html_url) 46 elem = driver_pic.find_element_by_xpath("//div[@class=‘pic_bg‘]/div/img") 47 url = elem.get_attribute("src") 48 ‘‘‘ 49 #采用正则表达式获取第3个<div></div> 再获取图片URL进行下载 50 content = urllib.urlopen(html_url).read() 51 start = content.find(r‘<div class="flTab">‘) 52 end = content.find(r‘<div class="comMark" style>‘) 53 content = content[start:end] 54 div_pat = r‘<div.*?>(.*?)<\/div>‘ 55 div_m = re.findall(div_pat, content, re.S|re.M) 56 #print div_m[2] 57 link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\‘).+?(?=\‘)", div_m[2]) 58 #print link_list 59 url = link_list[0] #仅仅一条url链接 60 loadPicture(url, path) 61 count = count + 1 62 63 except Exception,e: 64 print ‘Error:‘,e 65 finally: 66 print ‘Download ‘ + str(count) + ‘ pictures\n‘ 67 68 69 #爬取主页图片集的URL和主题 70 def getTitle(url): 71 try: 72 #爬取URL和标题 73 count = 0 74 print ‘Function getTitle(key,url)‘ 75 driver.get(url) 76 wait.until(lambda driver: driver.find_element_by_xpath("//div[@class=‘piclist3‘]")) 77 print ‘Title: ‘ + driver.title + ‘\n‘ 78 79 #缩略图片url(此处无用) 图片数量 标题(文件名) 注意顺序 80 elem_url = driver.find_elements_by_xpath("//a[@class=‘ku‘]/img") 81 elem_num = driver.find_elements_by_xpath("//div[@class=‘piclist3‘]/table/tbody/tr/td/dl/dd[1]") 82 elem_title = driver.find_elements_by_xpath("//div[@class=‘piclist3‘]/table/tbody/tr/td/dl/dt/a") 83 for url in elem_url: 84 pic_url = url.get_attribute("src") 85 html_url = elem_title[count].get_attribute("href") 86 print elem_title[count].text 87 print html_url 88 print pic_url 89 print elem_num[count].text 90 91 #创建图片文件夹 92 path = "E:\\Picture_HP\\" + elem_title[count].text + "\\" 93 m = re.findall(r‘(\w*[0-9]+)\w*‘, elem_num[count].text) #爬虫图片张数 94 nums = int(m[0]) 95 count = count + 1 96 if os.path.isfile(path): #Delete file 97 os.remove(path) 98 elif os.path.isdir(path): #Delete dir 99 shutil.rmtree(path, True) 100 os.makedirs(path) #create the file directory 101 getScript(html_url, path, nums) #visit pages 102 103 except Exception,e: 104 print ‘Error:‘,e 105 finally: 106 print ‘Find ‘ + str(count) + ‘ pages with key\n‘ 107 108 #Enter Function 109 def main(): 110 #Create Folder 111 basePathDirectory = "E:\\Picture_HP" 112 if not os.path.exists(basePathDirectory): 113 os.makedirs(basePathDirectory) 114 115 #Input the Key for search str=>unicode=>utf-8 116 key = raw_input("Please input a key: ").decode(sys.stdin.encoding) 117 print ‘The key is : ‘ + key 118 119 #Set URL List Sum:1-2 Pages 120 print ‘Ready to start the Download!!!\n\n‘ 121 starttime = datetime.datetime.now() 122 num=1 123 while num<=1: 124 #url = ‘http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2?p=2&o=1‘ 125 url = ‘http://photo.hupu.com/nba/tag/%E9%A9%AC%E5%88%BA‘ 126 print ‘第‘+str(num)+‘页‘,‘url:‘+url 127 #Determine whether the title contains key 128 getTitle(url) 129 time.sleep(2) 130 num = num + 1 131 else: 132 print ‘Download Over!!!‘ 133 134 #get the runtime 135 endtime = datetime.datetime.now() 136 print ‘The Running time : ‘,(endtime - starttime).seconds 137 138 main()
源程序主要步骤如下:
1.入口main函数中,在E盘下创建图片文件夹Picture_HP,然后输入图集url,本打算输入tag来进行访问的,因为URL如下:
http://photo.hupu.com/nba/tag/马刺
但是解析URL中文总是错误,故改成输入URL,这不影响大局。同时你可能发现了代码中while循环条件为num<=1,它只执行一次,建议需要下载哪页图集,就赋值URL即可。但是虎扑的不同页链接如下,通过分析URL拼接也是可以实现循环获取所有页的。
http://photo.hupu.com/nba/tag/%E9%99%88%E9%9C%B2?p=2&o=1
2.调用getTitle(rul)函数,通过Selenium和Phantomjs分析HTML的DOM结构,通过find_elements_by_xpath函数获取原图路径URL、图集的主题和图片数量。如图:
4.最后一步即urllib.urlretrieve(pic_url, pic_path + pic_name)下载图片即可。
当然你可能会遇到错误“Error: [Errno 22] invalid mode (‘wb‘) or filename”,参考 stackoverflow
[python爬虫] Selenium定向爬取虎扑篮球海量精美图片
原文:http://www.cnblogs.com/eastmount/p/5055921.html