首页 > 编程语言 > 详细

基于python第三方requests 模块的HTTP请求类

时间:2016-12-31 21:45:02      阅读:1207      评论:0      收藏:0      [点我收藏+]

标签:爬取网页   dom   .get   useragent   install   req   pri   pytho   except   

使用requests模块构造的下载器,首先安装第三方库requests

pip install requests


1
class StrongDownload(object): 2 3 def __init__(self): 4 #拿到代理iplist 5 self.iplist = [自己想办法搞] # 6 self.UserAgent = [自己想办法搞] 7 8 def get(self,url,timeout,proxy=False,num_retries=3): 9 ‘‘‘url timeout,proxy,num_retries 10 返回response对象 11 ‘‘‘ 12 #伪造UserAgent 13 UA = self.UserAgent[random.randint(0,len(self.UserAgent)-1)] 14 #UA = random.choice(self.UserAgent) 15 headers = {User-Agent:UA} 16 if proxy == False: 17 try: 18 return requests.get(url,timeout=timeout,headers=headers) 19 except Exception as e: 20 print e 21 #判断retries 22 if num_retries>0: 23 print u爬取网页错误,10s后继续尝试..... 24 time.sleep(1) 25 return self.get(url, timeout, False, num_retries-1) 26 else:#重试次数用尽,使用代理 27 time.sleep(1) 28 IP = str(random.choice(self.iplist)) 29 proxy = {http,IP} 30 return self.get(url, timeout, True,5) 31 32 else:#使用代理 33 34 try: 35 print 开始使用代理 36 IP = random.choice(self.iplist) 37 proxy = {http:IP} 38 print proxy[http] 39 return requests.get(url,headers=headers, timeout=timeout, proxies=proxy) 40 except: 41 if num_retries>0: 42 print u代理爬取网页错误,10s后继续尝试..... 43 time.sleep(1) 44 return self.get(url, timeout,True, num_retries-1) 45 else:#代理也不行 46 time.sleep(30) 47 print u代理也没用... 48 return None 49

 

基于python第三方requests 模块的HTTP请求类

标签:爬取网页   dom   .get   useragent   install   req   pri   pytho   except   

原文:http://www.cnblogs.com/diaosir/p/6240221.html

(0)
(0)
   
举报
评论 一句话评论(0
0条  
登录后才能评论!
© 2014 bubuko.com 版权所有 鲁ICP备09046678号-4
打开技术之扣,分享程序人生!
             

鲁公网安备 37021202000002号