Requests 自动爬取HTML页面 自动网路请求提交
robots 网络爬虫排除标准
Beautiful Soup 解析HTML页面
实战
Re 正则表达式详解提取页面关键信息
Scrapy*框架
第一周:规则
第一单元:Requests库入门
1.安装
以管理员身份运行命令提示符
输入 pip install request
验证:
>>> import requests >>> r = requests.get("http://www.baidu.com") >>> r.status_code 200
requests.request():构造一个请求,支撑以各个方法的基础方法
requests.get():获取HTML网页的主要方法,对应于HTTP的GET
requests.get(url,params=None,**kwargs)
url:拟获取页面的url链接
params:url中的额外参数,字典或字节流格式,可选
**kwargs:12个控制访问的参数
Response对象的属性
r.status_code:HTTP请求的返回状态,200表示连接成功,404表示失败
r.text:HTTP响应内容的字符串形式,即,url对应的页面内容
r.encoding:从HTTP header中猜测的响应内容编码方式
r.apparent_encoding:从内容中分析出响应内容编码方式
r.content:HTTP响应内容的二进制形式
通用代码框架:
>>> import requests >>> def getHTMLText(url): try: r = requests.get(url,timeout=30) r.raise_for_status()#如果状态不是200,引发HTTPEorror异常 r.encoding = r.apparent_encoding return r.text except: return "产生异常"
>>> if __name__ == "__main__":
url="www.baidu.com"
print(getHTMLText(url))
产生异常
requests.head():网页头,HEAD
requests.post():向HTML网页提交POST请求的方法,POST
requests.put():PUT
requests.patch():局部修改请求,PATCH
requests.delete():删除请求,DELETE
requests.request(method,url,**kwargs)
method:请求方式,对应get/put/post等七种
r = requests.request(‘GET‘,url,**kwargs)
r = requests.request(‘HEAD‘,url,**kwargs)
r = requests.request(‘POST‘,url,**kwargs)
r = requests.request(‘PUT‘,url,**kwargs)
r = requests.request(‘PATCH‘,url,**kwargs)
r = requests.request(‘delete‘,url,**kwargs)
r = requests.request(‘OPTIONS‘,url,**kwargs)
**kwargs:控制访问的参数,可选
params:字典或字节序列,作为参数增加到url中
data:字典、字节序列或文件对象,作为Request的内容
json:JSON格式的数据
headers:
https://www.baidu.com/robots.txt
Requests库爬取实例
>>> import requests >>> url = "https://item.jd.com/2967929.html" >>> try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失败") <!DOCTYPE HTML> <html lang="zh-CN"> <head> <!-- shouji --> <meta http-equiv="Content-Type" content="text/html; charset=gbk" /> <title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title> <meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/> <meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货,并包括HUAWEI荣耀8网购指南,以及华为荣耀8图片、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息,网购华为荣耀8上京东,放心又轻松" /> <meta name="format-detection" content="telephone=no"> <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/2967929.html"> <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/2967929.html"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <link rel="canonical" href="//item.jd.com/2967929.html"/> <link rel="dns-prefetch" href="//misc.360buyimg.com"/> <link rel="dns-prefetch" href="//static.360buyimg.com"/> <link rel="dns-prefetch" href="//img10.360buyimg.com"/> <link rel="dns
>>> import requests >>> url = "https://www.amazon.cn/gp/product/B01MBL5Z3Y" >>> try: kv = {‘user-agent‘:‘Mozilla/5.0‘} r = requests.get(url,headers = kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[1000:2000]) except: print("Fail") ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1], ue_sn = "opfcaptcha.amazon.cn", ue_id = ‘HB12BAYVB85FMA4VRS38‘; } </script> </head> <body> <!-- To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases. --> <!-- Correios.DoNotSend --> <div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important"> <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto"> <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div> <div class="a-box a-alert a-alert-info a-spacing-base"> <div class="a-box-inner">
百度360搜索关键词提交
import requests keyword = ‘Python‘ try: kv = {‘q‘:keyword} r = requests.get("http://www.so.com/s",params = kv) print(r.request.url) r.raise_for_status() print(len(r.text)) except: print("爬取失败")
图片下载
import requests import os url = "http://wx1.sinaimg.cn/mw600/0076BSS5ly1g6hmmj82tpj30u018wdos.jpg" root = "E://pics//" path = root + url.split(‘/‘)[-1] try: if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): r = requests.get(url) with open(path,‘wb‘) as f: f.write(r.content) f.close() print("文件保存成功") else: print("文件已存在") except: print("爬取失败")
IP地址查询
import requests url = "http://m.ip138.com/ip.asp?ip=" try: r = requests.get(url+‘202.204.80.112‘) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[-300:]) except: print("爬取失败")
原文:https://www.cnblogs.com/kmxojer/p/11260085.html