python爬虫-爬韩寒新浪博客博文

时间：2015-12-12 20:09:37 阅读：239 评论：0 收藏：0 [点我收藏+]

博客地址：http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

爬第一页博文

 1 #-*-coding:utf-8-*-
 2 import re
   #导入正则表达式模块
 3 import urllib
   #导入urllib库
 4 
 5 url=‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘
   #第一页博文地址
 6 response = urllib.urlopen(url)
   #通过urllib库中的urlopen()函数来访问这个url
   #这里省略了构建request请求这一步
 7 html = response.read()
   #读取出来存在html这个变量当中，到这里也就完成了html的爬取
 8 #print(html)
 9 #这里可以将爬取到的html输出到终端
10 pattern = re.compile(‘<a title=.*?href=(.*?)>(.*?)</a>‘,re.S)
   #通过正则表达式来匹配
11 blog_address = re.findall(pattern,html)
   #通过findall函数从爬取到的html中找出所要的内容
12 for i in blog_address:
13     print(i[0])
       #输出第一个分组的内容即博客博文地址
14     print(i[1])
      #输出第二个分组的内容即博文标题

部分结果如下：

技术分享

所遇到的问题：1爬取的结果多了两个，第一个和最后一个不是所要的内容？

2 输出结果的时候用print(i[0],i[1])出现乱码，这是为什么？

通过while循环来解决多页的问题

 1 #-*-coding:utf-8-*-
 2 import re
 3 import urllib
 4 page=1
 5 while page<=7:
 6     url=‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘
 7     #url=‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘
 8     response = urllib.urlopen(url)
 9     html = response.read().decode(‘utf-8‘)
10     #print(html)
11     pattern = re.compile(‘<a title=.*?target=.*?href=(.*?)>(.*?)</a>‘,re.S)
12     blog_address = re.findall(pattern,html)
13     for i in blog_address:
14         print(i[0])
15         print(i[1])
16     page = page + 1

结果最后部分如下图：

技术分享

python爬虫-爬韩寒新浪博客博文

原文：http://www.cnblogs.com/wujiadong2014/p/5041705.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)