python爬虫Pragmatic系列II

时间：2015-03-27 20:01:26 阅读：264 评论：0 收藏：0 [点我收藏+]

python爬虫Pragmatic系列II

说明：

在上一篇博客中，我们已经学会了如何下载一个网页，并进行简单的分析它。

本次目标：

下载赶集网上其中一家公司的信息，将网页保存到文本文件中，然后我们从网页中提取有用的公司信息，并存储到Excel中。（注意，本节比上一节难度更大）

下载网页：

利用前一篇博客的下载代码，将url初始设为“http://bj.ganji.com/fuwu_dian/354461215x/”（该链接为赶集网上目前处于第一列第一家公司），运行即可得到65kb大小的存储该公司信息的file.txt文本文件。

代码：略。

分析网页：

这次的目标是提取前面url页面的联系店主模块下的信息，有公司名称，服务特色，提供服务等等共八个信息（略去工作时间这一项）。如下图：

由于网页比较复杂，如果只是单纯的使用正则表达式对整个网页进行匹配难度较大（我水平不好，这样做在找到了仅一半的数据就实在做不下去了）。所以，我们开始使用更高端大的工具，BeautifulSoup。学习这个工具的可以点这里：BeautifulSoup分析HTML和使用Soup在HTML中查找。

BeautifulSoup可以将整个网页解析成一棵文档树，接着，我们可以按照html文档树的结构对其成员进行访问，哈哈，比只使用正则表达式容易多了。

在将获取的信息存入Excel时，我们使用了xwlt（写入Excel文件的扩展工具），学习Excel的读写请点这里：python操作Excel读写。

代码：

#-*-coding:utf-8-*-
import re
from bs4 import BeautifulSoup
import xlwt
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def analysis():
    '''
    分析网页源码，并提取出公司相关信息
    '''
    #打开文件，读文件到lines中，关闭文件对象
    f = open("file.txt",'r')
    lines = f.readlines()
    f.close()

    #建立一个BeautifulSoup解析树，并利用这课解析树依次按照
    #soup-->body-->(id为wrapper的div层)-->(class属性为clearfix的div层)
    #-->(id为dzcontactus的div层)-->(class属性为con的div层)-->ul-->(ul下的每个li)
    soup = BeautifulSoup(''.join(lines))
    body = soup.body #body2 = soup.find('body')
    wrapper = soup.find(id="wrapper")
    clearfix = wrapper.find_all(attrs={'class':'clearfix'})[6]
    dzcontactus = clearfix.find(id="dzcontactus")
    con = dzcontactus.find(attrs={'class':'con'})
    ul = con.find('ul')
    li = ul.find_all('li')

    #记录一家公司的所有信息，用字典存储，可以依靠键值对存取，也可以换成列表存储
    record = {} 


    #公司名称
    companyName = li[1].find('h1').contents[0]
    #print companyName
    #record.append(companyName)
    record['companyName'] = companyName

    #服务特色
    serviceFeature = li[2].find('p').contents[0]
    #print serviceFeature
    #record.append(serviceFeature)
    record['serviceFeature'] = serviceFeature
    
    #服务提供
    serviceProvider = []
    serviceProviderResultSet = li[3].find_all('a')
    for service in serviceProviderResultSet:
        serviceProvider.append(service.contents[0])
        #print service.contents[0]
    #print serviceProvider[0]
    #record.append(serviceProvider)
    record['serviceProvider'] = serviceProvider

    #服务范围
    serviceScope = [] 
    serviceScopeResultSet = li[4].find_all('a')
    for scope in serviceScopeResultSet:
        serviceScope.append(scope.contents[0])
        #print scope.contents[0],
    #print serviceScope[0]
    #record.append(serviceScope)
    record['serviceScope'] = serviceScope

    #联系人
    contacts = li[5].find('p').contents[0]
    #contacts = contacts.replace(" ",'')
    contacts = str(contacts).strip().encode("utf-8")
    #print contacts
    #record.append(contacts)
    record['contacts'] = contacts

    #商家地址
    addressResultSet = li[6].find('p')
    re_h=re.compile('</?\w+[^>]*>')#HTML标签
    address = re_h.sub('', str(addressResultSet))
    #print address
    #record.append(address)
    record['address'] = address.encode("utf-8")

    #商家QQ
    qqNumResultSet = li[8]
    qq_regex = '(\d{5,10})'
    qqNum = re.search(qq_regex,str(qqNumResultSet))
    qqNum = qqNum.group()
    #print qqNum
    #record.append(qqNum)
    record['qqNum'] = qqNum
    
    #联系电话
    phoneNum = li[9].find('p').contents[0]
    phoneNum = int(phoneNum)
    #print phoneNum
    #record.append(phoneNum)
    record['phoneNum'] = phoneNum

    #公司网址
    companySite = li[10].find('a').contents[0]
    #print companySite
    #record.append(companySite)
    record['companySite'] = companySite

    return record

def writeToExcel(record):
    #print(sys.stdout.encoding)
    #print(sys.stdin.encoding)
    '''for r in record.keys():
        print record[r]
    '''
    wb = xlwt.Workbook()
    ws = wb.add_sheet('CompanyInfoSheet')

    #写入公司名称
    companyName = record['companyName']
    ws.write(0,0,companyName)

    
    #写入服务特色
    serviceFeature = record['serviceFeature']
    ws.write(0,1,serviceFeature)

    #写入服务范围
    serviceScope = ','.join(record['serviceScope'])
    ws.write(0,2,serviceScope)

    #写入联系人
    contacts = record['contacts']
    ws.write(0,3,contacts.decode("utf-8"))
    
    #写入商家地址
    address = record['address']
    ws.write(0,4,address.decode("utf-8"))
    
    #写入聊天QQ
    qqNum = record['qqNum']
    ws.write(0,5,qqNum)
    
    #写入联系电话
    phoneNum = record['phoneNum']
    phoneNum = str(phoneNum).encode("utf-8")
    ws.write(0,6,phoneNum.decode("utf-8"))
    
    #写入网址
    companySite = record['companySite']
    ws.write(0,7,companySite)
    wb.save('xinrui.xls')
    

if __name__ == '__main__':
    writeToExcel(analysis())

运行结果Excel截图：

过程体会：

做的过程遇到了很多问题，最头疼的还是编码问题，一直报：UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe5 in position108: ordinal not in range(128)问题，找到好些方案，都没能解决掉，最后不得已使用string类中encode和decode终于摆脱掉中文存储问题了。

听说python3.x区分了 unicode str 和 byte arrary，并且默认编码不再是 ascii（似乎该转向3了）。

未完待续。

python爬虫Pragmatic系列II

原文：http://blog.csdn.net/whiterbear/article/details/44680089

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)