首页 > 其他 > 详细

爬虫神器xpath的用法(四)

时间:2016-03-06 14:19:30      阅读:257      评论:0      收藏:0      [点我收藏+]

使用xpath多线程爬取百度贴吧内容

#encoing=utf-8
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import requests
import json
import sys

reload(sys)

sys.setdefaultencoding(utf-8)

‘‘‘重新运行之前请删除content.txt,因为文件操作使用追加方式,会导致内容太多。‘‘‘

def towrite(contentdict):
    f.writelines(u回帖时间: + str(contentdict[topic_reply_time]) + \n)
    f.writelines(u回帖内容: + unicode(contentdict[topic_reply_content]) + \n)
    f.writelines(u回帖人: + contentdict[user_name] + \n\n)

def spider(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    content_field = selector.xpath(//div[@class="l_post j_l_post l_post_bright  "])
    item = {}
    for each in content_field:
        reply_info = json.loads(each.xpath(@data-field)[0].replace(&quot,‘‘))
        author = reply_info[author][user_name]
        content = each.xpath(div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content  clearfix"]/text())[0]
        reply_time = reply_info[content][date]
        item[user_name] = author
        item[topic_reply_content] = content
        item[topic_reply_time] = reply_time
        towrite(item)

if __name__ == __main__:
    pool = ThreadPool(4)
    f = open(content.txt,a)
    page = []
    for i in range(1,10):
        newpage = http://tieba.baidu.com/p/3522395718?pn= + str(i)
        page.append(newpage)

    results = pool.map(spider, page)
    pool.close()
    pool.join()
    f.close()

 

爬虫神器xpath的用法(四)

原文:http://www.cnblogs.com/gide/p/5247146.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!