py3爬虫的几个基本库

时间：2020-07-05 16:49:02 阅读：58 评论：0 收藏：0 [点我收藏+]

1. Urllib库

urllib库是python内置的HTTP请求库，它包含如下几个模块：

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse URL解析模块
urllib.robotparser robots.txt解析模块

1.1 urllib.request

1.1.1 urlopen函数

1）urlopen函数的返回类型

urlopen函数返回的是一个bytes类型的数据，通过read()函数读取内容之后再进行decode转码后才能查看。

# urlopen函数
import urllib.request
response = urllib.request.urlopen(‘http://www.baidu.com‘)
# response是一个bytes类型的数据，所以还需要转码成utf-8
print(response.read().decode(‘utf-8‘))

2) urlopen函数的参数

urlopen函数的第一个参数是url，第二个参数是请求附加的数据，第三个参数是超时时间
如果urlopen函数传了第二个参数，则表示以POST方式提交请求，且第二个参数要用bytes类型来传入

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({‘word‘: ‘hello‘}), encoding=‘utf8‘)
# 如果urlopen函数传了第二个参数，则表示以POST方式提交请求，第二个参数要用bytes类型来传入
# urlopen的第三个参数表示超时时间，若超过这个时间还没有得到响应，则会抛出异常
response = urllib.request.urlopen(‘http://httpbin.org/post‘, data=data, timeout=2)
print(response.read().decode(‘utf-8‘))

超时时间：

import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen(‘http://httpbin.org/get‘, timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print("超时")

1.1.2 响应

# 响应
import urllib.request
response = urllib.request.urlopen(‘http://cnblogs.com/hgzero‘)
print(type(response))
print(response.status)                     # 状态码
print(response.getheaders())               # 响应头，得到的是一个个元组组成的列表
print(response.getheader(‘Content-Type‘))  # 注意这里的getheader没有加s

1.1.3 请求

# 请求
import urllib.request
request = urllib.request.Request(‘http://cnblogs.com/hgzero‘)
response = urllib.request.urlopen(request)   # 将Request对象当做一个参数传给urlopen
print(response.read().decode(‘utf-8‘))

在请求中自定义http头和data数据：

# 第一种，自己构造一个包含自定义headers和data的Request对象，再将Request对象传入urlopen函数
from urllib import request, parse
url = ‘http://httpbin.org/post‘
headers = {
    ‘User-Agent‘: ‘Mozilia/4.0 (compatible; MSIE 5.5;Windows NT)‘,
    ‘Host‘: ‘httpbin.org‘
}
dict = {
    ‘name‘: ‘Germey‘
}
data = bytes(parse.urlencode(dict), encoding=‘utf8‘)
req = request.Request(url=url, data=data, headers=headers, method=‘POST‘)
response = request.urlopen(req)
print(response.read().decode(‘utf-8‘))


# 第二种，通过调用Request对象的add_header方法来添加http头
from urllib import request, parse
url = ‘http://httpbin.org/post‘
dict = {
     ‘name‘: ‘Germey‘
}
data = bytes(parse.urlencode(dict), encoding=‘utf8‘)
req = request.Request(url=url, data=data, method=‘POST‘)
req.add_header(‘User-Agent‘, ‘Mozilia/4.0 (compatible; MSIE 5.5;Windows NT)‘)  # 添加一个http头
response = request.urlopen(req)
print(response.read().decode(‘utf-8‘))

1.1.4 代理

import urllib.request
proxy_handler = urllib.request.ProxyHandler(
    {
        ‘http‘: ‘http://127.0.0.1:25379‘,
        # ‘https‘: ‘https://127.0.0.1:25379‘
    }
)
opener = urllib.request.build_opener(proxy_handler)
response = opener.open(‘http://www.youtobe.com/‘)
print(response.read().decode(‘utf-8‘))

1.1.5 Cookie

import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(‘http://www.baidu.com‘)
for item in cookie:
    print(item.name+"="+item.value)

cookie的保存和读取：

import http.cookiejar, urllib.request
# MozillaCookieJar的cookie保存格式
filename = "cookie_first.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(‘http://www.baidu.com‘)
cookie.save(ignore_discard=True, ignore_expires=True)

# LWPCookieJar的cookie保存格式,两种保存格式随便选一种即可
filename = "cookie_second.txt"
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(‘http://www.baidu.com‘)
cookie.save(ignore_discard=True, ignore_expires=True)

# 下次请求时再读取保存的cookie
cookie = http.cookiejar.LWPCookieJar()
cookie.load(‘cookie_second.txt‘, ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(‘http://www.baidu.com‘)
print(response.read().decode(‘utf-8‘))

1.2 urllib.error

from urllib import request, error
try:
    response = request.urlopen(‘http://hgzerowzhpray.com‘)
except error.HTTPError as e:  # 这个错误范围较小
    print(e.reason, e.code, e.headers, sep=‘\n‘)
except error.URLError as e:   # 这个错误范围较大
    print(e.reason)
else:
    print(‘Request Successfully‘)

查看HTTPError和URLError的信息

1.3 urllib.parse

1.4 urllib.robotparser

用的不多，直接忽略。

2. Requests库

py3爬虫的几个基本库

原文：https://www.cnblogs.com/hgzero/p/13246468.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)