python3爬虫之urllib初探

时间：2019-08-01 16:22:56 阅读：110 评论：0 收藏：0 [点我收藏+]

urllib主要包含request（请求模块）、error（异常处理模块）、parse（工具模块）、robotparser（识别网站的robots.txt文件，是否允许爬取）。

request（请求模块）

1、request.urlopen（发送请求）

import urllib.request
 
response = urllib.request.urlopen(‘https://www.python.org‘)

print(response.read().decode(‘utf-8‘))

用法

urlopen所有参数
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

　　1、url

必选：链接

　　2、data

可选，使用data是通过post传值，并且数据格式必须为bytes类型，使用方法如下：

import urllib.parse
import urllib.request
 
data = bytes(urllib.parse.urlencode({‘word‘: ‘hello‘}), encoding=‘utf8‘)
response = urllib.request.urlopen(‘http://httpbin.org/post‘, data=data)
print(response.read())

　　3、timeout

参数用于设置超时时间，单位为秒，

import socket
import urllib.request
import urllib.error
 
try:
    response = urllib.request.urlopen(‘http://httpbin.org/get‘, timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print(‘TIME OUT‘)

View Code

　　4、其他参数

context参数，它必须是ssl.SSLContext类型，用来指定SSL设置。此外，cafile和capath这两个参数分别指定CA证书和它的路径，这个在请求HTTPS链接时会有用。

cadefault参数现在已经弃用了，其默认值为False。

2、request.Request类

作用：构建完整的请求信息。

import urllib.request
 
request = urllib.request.Request(‘https://python.org‘)
response = urllib.request.urlopen(request)
print(response.read().decode(‘utf-8‘))

View Code

Request参数

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

　　1、url

　　2、data

data如果要传，必须传bytes（字节流）类型的。如果它是字典，可以先用urllib.parse模块里的urlencode()编码。

data = bytes(urllib.parse.urlencode({‘word‘: ‘hello‘}), encoding=‘utf8‘)

　　3、headers

headers是一个字典，它就是请求头，我们可以在构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_header()方法添加.

　　4、origin_req_host

请求方的host名称或者IP地址。

　　5、unverifiable

表示这个请求是否是无法验证的，默认是False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，这时unverifiable的值就是True`。

　　6、method

是一个字符串，用来指示请求使用的方法，比如GET、POST和PUT等。

from urllib import request, parse
 
url = ‘http://httpbin.org/post‘
headers = {
    ‘User-Agent‘: ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘,
    ‘Host‘: ‘httpbin.org‘
}
dict = {
    ‘name‘: ‘Germey‘
}
data = bytes(parse.urlencode(dict), encoding=‘utf8‘)
req = request.Request(url=url, data=data, headers=headers, method=‘POST‘)
response = request.urlopen(req)
print(response.read().decode(‘utf-8‘))

3、urllib高阶

https://cuiqingcai.com/5500.html停一下

python3爬虫之urllib初探

原文：https://www.cnblogs.com/hardykay/p/10822575.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)