网络爬虫-Python

时间：2014-12-07 06:53:56 阅读：348 评论：0 收藏：0 [点我收藏+]

周末没事自己写了个网络爬虫，先介绍一下它的功能，这是个小程序，主要用来抓取网页上的文章，博客等，首先找到你要抓取的文章，比如韩寒的新浪博客，进入他的文章目录，记下目录的连接比如 http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html，里面每篇文章都有个连接，我们现在需要做的就是根据每个链接进入并把文章复制到你自己的电脑文件里。这就把文章爬下来了哈哈，不说了直接来代码吧

import urllib

import time

url=[‘‘]*50

j = 0

con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read() #目录链接

i=0

title = con.find(r‘<a title=‘) #找到第一次出现<a title=的位置

href = con.find(r‘href=‘,title) #找到<a title=之后出现href=的位置

html = con.find(r‘.html‘,href) #同上

while title != -1 and href != -1 and html != -1 and i<50: #目录下面大概50篇文章

url[i] = con[href + 6:html +5] #抓取每篇文章的链接

print url[i]

title = con.find(r‘<a title=‘,html) #循环抓取每篇文章

href = con.find(r‘href=‘,title)

html = con.find(r‘.html‘,href)

i= i+1

while j < 50:

content = urllib.urlopen(url[j]).read() #读取每个链接内的内容

#print content

filename = url[j][-26:]

open(filename,‘w+‘).write(content) #把内容写到你自己定义的文件下

print ‘downloading‘ ,url[j]

j = j+1

time.sleep(1) #睡眠时间

本文出自 “子夜” 博客，谢绝转载！

原文：http://5939540.blog.51cto.com/5929540/1587065

踩

(0)

评论一句话评论（0）

分享档案

更多>