理解爬虫原理

时间：2019-03-25 13:36:03 阅读：168 评论：0 收藏：0 [点我收藏+]

1、爬虫原理

　　向网站发送请求，获取网页资源并提取所需要的数据。

2、爬虫的开发过程

　　（1）浏览器的工作原理

　　　　通过浏览器向目标站点发送请求，经过服务器的处理之后又通过浏览器做出反应。

　　（2）使用 requests 库抓取网站数据

url=‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
res=requests.get(url)

　　（3）html代码

<html>
 <body>
  <h1 id="title">Hello</h1>
  <a href="#" class="link"> This is link1</a><a href="# link2" class="link" qao=123> This is link2</a>
　<p id="info">This is info
 </body>
</html>

　　使用 Beautiful Soup 解析网页：

　　找出标签为‘h1’的HTML元素

t=soup2.select(‘h1‘)[0].text
print(t)

　　找出类名为‘link’的HTML元素

for i in range(len(soup2.select(‘.link‘))):
    d=soup2.select(‘.link‘)[i].text
    print(d)

　　找出含有特定id名的html元素

info=soup2.select(‘#info‘)[0].text
print(info)

3、提取一篇校园新闻的标题、发布时间、发布单位

import request
import requests
requests
import bs4
from bs4 import BeautifulSoup
bs4
url=‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0322/11042.html‘
res=requests.get(url)
res.encoding=‘utf-8‘
soup=BeautifulSoup(res.text,‘html.parser‘)
title=soup.select(‘.show-title‘)[0].text
time=soup.select(‘.show-info‘)[0].text
print(title,time)

理解爬虫原理

原文：https://www.cnblogs.com/pybblog/p/10592967.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)