本节主要学习python语言中网络相关知识。
一  
主要文件和目录在Urllib的request.py模块下面。其中支持SSL加密方式访问。
下面我们看看其中的主要类和函数吧。
先看看源码吧。
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False):
    global _opener
    if cafile or capath or cadefault:
        if not _have_ssl:
            raise ValueError('SSL support not available')
        context = ssl._create_stdlib_context(cert_reqs=ssl.CERT_REQUIRED,
                                             cafile=cafile,
                                             capath=capath)
        https_handler = HTTPSHandler(context=context, check_hostname=True)
        opener = build_opener(https_handler)
    elif _opener is None:
        _opener = opener = build_opener()
    else:
        opener = _opener
    return opener.open(url, data, timeout)
直接利用URLOPEN函数进行web访问,主要传递的关键参数,就是网址的具体URL
import urllib.request
if __name__ == '__main__':
    print('Main Thread Run :', __name__)
    ResponseData = urllib.request.urlopen('http://www.baidu.com/robots.txt')
    strData = ResponseData.read()
    strShow = strData.decode('utf-8')
    if(False):
        print(ResponseData.geturl())
    if(False):
        print(ResponseData.info())
    else:
        print(ResponseData.__sizeof__())
        print(strShow)
    ResponseData.close()
    print('\nMain Thread Exit :', __name__)结果如下
Main Thread Run : __main__ 32 User-agent: Baiduspider Disallow: /baidu Disallow: /s? Disallow: /ulink? Disallow: /link? User-agent: Googlebot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: MSNBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Baiduspider-image Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: YoudaoBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou web spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou inst spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou spider2 Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou blog Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou News Spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sogou Orion spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: ChinasoSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: Sosospider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: yisouspider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: EasouSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? User-agent: * Disallow: / Main Thread Exit : __main__
函数urlretrieve可以实现直接传递URL地址读取该web网页内容,并且以本地文件存储。
函数返回值是一个list其中包括两个参数,第一个是本地存储文件名称,第二个是web服务
返回的http响应头
def urlretrieve(url, filename=None, reporthook=None, data=None):
    """
    Retrieve a URL into a temporary location on disk.
代码测试
import urllib.request
if __name__ == '__main__':
    print('Main Thread Run :', __name__)
    data = urllib.request.urlretrieve('http://www.baidu.com/robots.txt', 'robots.txt')
    print('--filename--:', data[0])
    print('--response--:', data[1])
    print('\nMain Thread Exit :', __name__)
Main Thread Run : __main__ --filename--: robots.txt --response--: Date: Mon, 22 Sep 2014 08:08:05 GMT Server: Apache P3P: CP=" OTI DSP COR IVA OUR IND COM " Set-Cookie: BAIDUID=4FB847BEE916A0F72ABC5093271CD2BC:FG=1; expires=Tue, 22-Sep-15 08:08:05 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1 Last-Modified: Thu, 17 Jul 2014 07:10:38 GMT ETag: "91e-4fe5e56791780" Accept-Ranges: bytes Content-Length: 2334 Vary: Accept-Encoding,User-Agent Connection: Close Content-Type: text/plain Main Thread Exit : __main__
函数request_host解析url中包含的主机地址 传入参数只有一个Request的对象实例
至于Request对象待会介绍。
下面看看该函数源代码
def request_host(request):
    """Return request-host, as defined by RFC 2965.
    Variation from RFC: returned value is lowercased, for convenient
    comparison.
    """
    url = request.full_url
    host = urlparse(url)[1]
    if host == "":
        host = request.get_header("Host", "")
    # remove port, if present
    host = _cut_port_re.sub("", host, 1)
    return host.lower()import urllib.request
if __name__ == '__main__':
    print('Main Thread Run :', __name__)
    Req = urllib.request.Request('http://www.baidu.com/robots.txt')
    host = urllib.request.request_host(Req)
    print(host)
    print('\nMain Thread Exit :', __name__)
结果:
Main Thread Run : __main__ www.baidu.com Main Thread Exit : __main__
四  
下面介绍该模块主要类Request类。注意啊,是大写的R别搞错了。
先看看源代码
class Request:
    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):
        self.full_url = url
        self.headers = {}
        self.unredirected_hdrs = {}
        self._data = None
        self.data = data
        self._tunnel_host = None
        for key, value in headers.items():
            self.add_header(key, value)
        if origin_req_host is None:
            origin_req_host = request_host(self)
        self.origin_req_host = origin_req_host
        self.unverifiable = unverifiable
        if method:
            self.method = method
    @property
    def full_url(self):
        if self.fragment:
            return '{}#{}'.format(self._full_url, self.fragment)
        return self._full_url
    @full_url.setter
    def full_url(self, url):
        # unwrap('<URL:type://host/path>') --> 'type://host/path'
        self._full_url = unwrap(url)
        self._full_url, self.fragment = splittag(self._full_url)
        self._parse()
    @full_url.deleter
    def full_url(self):
        self._full_url = None
        self.fragment = None
        self.selector = ''
    @property
    def data(self):
        return self._data
    @data.setter
    def data(self, data):
        if data != self._data:
            self._data = data
            # issue 16464
            # if we change data we need to remove content-length header
            # (cause it's most probably calculated for previous value)
            if self.has_header("Content-length"):
                self.remove_header("Content-length")
    @data.deleter
    def data(self):
        self.data = None
    def _parse(self):
        self.type, rest = splittype(self._full_url)
        if self.type is None:
            raise ValueError("unknown url type: %r" % self.full_url)
        self.host, self.selector = splithost(rest)
        if self.host:
            self.host = unquote(self.host)
    def get_method(self):
        """Return a string indicating the HTTP request method."""
        default_method = "POST" if self.data is not None else "GET"
        return getattr(self, 'method', default_method)
    def get_full_url(self):
        return self.full_url
    def set_proxy(self, host, type):
        if self.type == 'https' and not self._tunnel_host:
            self._tunnel_host = self.host
        else:
            self.type= type
            self.selector = self.full_url
        self.host = host
    def has_proxy(self):
        return self.selector == self.full_url
    def add_header(self, key, val):
        # useful for something like authentication
        self.headers[key.capitalize()] = val
    def add_unredirected_header(self, key, val):
        # will not be added to a redirected request
        self.unredirected_hdrs[key.capitalize()] = val
    def has_header(self, header_name):
        return (header_name in self.headers or
                header_name in self.unredirected_hdrs)
    def get_header(self, header_name, default=None):
        return self.headers.get(
            header_name,
            self.unredirected_hdrs.get(header_name, default))
    def remove_header(self, header_name):
        self.headers.pop(header_name, None)
        self.unredirected_hdrs.pop(header_name, None)
    def header_items(self):
        hdrs = self.unredirected_hdrs.copy()
        hdrs.update(self.headers)
        return list(hdrs.items())
def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):注意里面几个关键参数 url 代表你要访问的URL地址,Data代表你要发送的POST数据,
headers表示你需要在http请求头中包含的头部信息字段
method代表使用GET还是POST方法。
默认是POST传递方式
<span style="font-size:12px;">Req = urllib.request.Request('http://www.baidu.com/robots.txt')</span>例如增加一个User-Agent的字段头请求头
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
Req= urllib.request.Request(url='http://www.baidu.com/robots.txt‘, headers=USER_AGENT)修改超时时间
import socket socket.setdefaulttimeout(10)#10s
下面介绍代理使用
代理配置和相关地址信息,必须在调用web访问服务之前进行。
使用代码示例如下:
import socket 
import urllib.request
socket.setdefaulttimeout(10)  # 10s
if __name__ == '__main__':
    print('Main Thread Run :', __name__)
    proxy = urllib.request.ProxyHandler({'http':'http://www.baidu.com:8080'})
    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    content = urllib.request.urlopen('http://www.baidu.com/robots.txt').read()
    print('\nMain Thread Exit :', __name__)
六: 错误异常处理
python的网络服务异常处理相关函数和使用。
主要是try和execpt语句块的使用。记住重要一点
python的异常处理语句,最好是一行代码一抛出一捕捉
示例:
    try :
        reqUrl = urllib.request.Request(url='http://www.baidu.com/robots.txt', headers=USER_AGENT)
    except HTTPError:
        print('urllib.error.HTTPError')
    except URLError:
        print('urllib.error.URLError')
    except OSError:
        print('urllib.error.OSError')
    try :
        responseData = urllib.request.urlopen(reqUrl)
    except HTTPError:
        print('urllib.error.HTTPError')
    except URLError:
        responseData.close()
        print('urllib.error.URLError')
    except OSError:
        print('urllib.error.OSError')
    try :
        pageData = responseData.read()
    except HTTPError:
        responseData.close()
        print('urllib.error.HTTPError')
    except URLError:
        responseData.close()
        print('urllib.error.URLError')
    except OSError:
        print('urllib.error.OSError')
    print(pageData)
    responseData.close()
七 说明
以上大概是基本的web访问服务函数和类的一些使用方法,当然还有很多方法和函数可以实现相同功能。
根据个人意愿和需求进行调用。还有一个记得本人只是基础学习,整理笔记于此,方便菜鸟和自己后期查阅。
原文:http://blog.csdn.net/microzone/article/details/39476893