首页 > 微信 > 详细

爬虫小程序之爬取链家二手房

时间:2019-06-26 21:39:58      阅读:149      评论:0      收藏:0      [点我收藏+]

利用Python相关工具爬取链家二手房中的各房源信息,包含小区名称、户型、装修情况、单价和总价

要求:

  1、使用工具:urllib.request(请求)、re(解析)、csv(存储)

  2、编程范式:面向对象

  3、反反爬机制:利用time和random控制请求频次、伪装请求头User-Agent

代码如下:

  

from urllib import request
import re
import csv
import time
import random


class LianjiaSpider(object):
    def __init__(self, pages=1):
        self.base_url = https://hf.lianjia.com/ershoufang/pg{}
        self.headers = {User-Agent: Mozilla/5.0}
        self.pages = pages

    # 获取请求响应
    def get_page(self, url):
        req = request.Request(url, headers=self.headers)
        res = request.urlopen(req)
        html = res.read().decode(utf-8)
        self.parse_page(html)

    # 解析相应内容
    def parse_page(self, html):
        pattern = re.compile(
            r<div class="houseInfo".*?data-el="region">(.*?)</a>(.*?)</div>.*?<div class="totalPrice".*?<span>(.*?)</span>,
            re.S)
        house_list = pattern.findall(html)
        for house in house_list:
            print(house[1].split(|))
        self.save_csv(house_list)

    # 保存解析内容
    def save_csv(self, house_list):
        house_new_list = []
        for house in house_list:
            house = (house[0].strip(),
                     house[1].split(|)[1].strip(),
                     house[1].split(|)[2].strip()[:-2],
                     house[1].split(|)[4].strip(),
                     house[2].strip(),)
            house_new_list.append(house)

        with open(lianjia_ershoufang.csv, a+, encoding=utf-8, newline=‘‘) as f:
            writer = csv.writer(f)
            writer.writerows(house_new_list)

    def main(self):
        for page in range(1, self.pages + 1):
            url = self.base_url.format(str(page))
            self.get_page(url)
            print(第%d页下载成功 % page)
            time.sleep(random.randint(1, 10))


if __name__ == __main__:
    with open(lianjia_ershoufang.csv, a+, encoding=utf-8, newline=‘‘) as f:
        writer = csv.writer(f)
        writer.writerow([小区名称, 户型, 面积, 装修状况, 总价])
    spider = LianjiaSpider(100)
    spider.main()

 

 

  

爬虫小程序之爬取链家二手房

原文:https://www.cnblogs.com/yuxiangyang/p/11093471.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!