爬虫-案例

时间：2019-12-18 00:31:29 阅读：110 评论：0 收藏：0 [点我收藏+]

1、爬虫是啥？

2、http协议里需要关注的

2.1 请求需要关注的东西 requests

url : 告诉浏览器,你要去哪里

Method:

get:

数据:url?key=value&key=value

post:

请求体:

form data

文件类型

json

headers:

cookie:保存用户登录状态

User-Agent:告诉服务器你是谁

refere:告诉服务器你从哪里来

服务器规定的特殊字段

2.2 请求需要关注的东西 response

Status Code:

2xx

请求成功(不一定)---后台程序员自己规定的---不能用作请求成功的唯一判断标准

3xx

重定向

响应头:

location:重定向地址

set_cookie:设置cookie

服务器规定的特殊字段

响应体:

1.html代码(css,html,js)

2.json

3.二进制(图片,视频,音频)

3、常用请求库、解析库、数据库的用法

3.1 常用请求库测试网站：http://httpbin.org/get

request库

安装：pip install requests

使用：

请求：

①get请求：

                响应对象 = requests.get(......)
?
•               **参数：**
?
•                   url：
?
•                   headers = {}
?
•                   cookies = {}        优先级低于headers里的cookie字段
?
•                   params = {} 
?
•                   proxies = {‘http‘：‘http://ip：端口’}
?
•                   timeout = 0.5
?
•                   allow_redirects = True

②post请求：

                响应对象 = requests.post(......)
?
•               **参数：**
?
•                   url：
?
•                   headers = {}
?
•                   cookies = {}
?
•                   data = {}
?
•                   json = {}
?
•                   files = {‘file’：open（...，‘rb’）}
?
•                   timeout = 0.5
?
•                   allow_redirects = False

自动保存cookie的请求：

            session = requests.session（）
?
•           r = session.get(......)
?
•           r = session.post(......)
      补充:(保存cookie到本地)
        import http.cookiejar as cookielib
        session.cookie = cookielib.LWPCookieJar()
        session.cookie.save(filename=‘1.txt‘)
        
        session.cookies.load(filename=‘1.txt‘)

响应：

            r.url
?
•           r.text    常用
?
•           r.encoding = ‘gbk‘    常用
?
•           r.content     常用
?
•           r.json()      常用
 
•           r.status_code    用的少
?
•           r.headers  
?
•           r.cookies  
?
•           r.history

3.2 常用解析语法

css选择器

1、类选择器

.类名

2、id选择器

#id值

3、标签选择器

标签名

4、后代选择器

选择器1 选择器2

5、子选择器

选择器1>选择器2

6、属性选择器

[属性名] #只要有这个属性名的,都会被选中

[属性名 = 属性值] #只要有这个属性名,并且值相等的,都会被选中

<h1 class="xxx yyy " ></h1>
?
[class="xxx yyy "]

[属性名 ^= 值]

[属性名 &= 值]

[属性名*= 值]

7、群组选择器

选择器1,选择器2 or

8、多条件选择器

选择器1选择器2 and

p[pro="xxx"]

xpath选择器

略

3.3 牛逼的requests-html

安装： pip install requests-html

使用：

请求：

            from requests_html import HTMLSession
?
•           session = HTMLSession()
?
•           **参数：**
?
•               browser.args = [
?
•                   ‘--no-sand‘,
?
•                   ‘--user-agent=XXXXX‘
?
•               ]
?
•           响应对象 = session.request（......）
?
•           响应对象 = session.get（......）
?
•           响应对象 = session.post（......）

参数和requests模块一毛一样

响应：

            r.url
?
•           **属性和requests模块一毛一样

解析：

html对象属性：

            r.html.absolute_links          /xx/yy   -->    http://www....../xx/yy
?
•                  .links                 路径原样
?
•                      .base_url          网站基础路径
 
•                      .html              解码过的响应内容   #相当于r.text
?
•                      .text
?
•                      .encoding = ‘gbk‘     控制的是r.html.html的解码格式  
?
•                      .raw_html            相当于r.content
?
•                      .pq

html对象方法：

            r.html.find(‘css选择器‘)              [element对象,element对象...]
?
•                  .find(‘css选择器‘，first = True)     对一个element对象
?
•                  .xpath(‘xpath选择器’)
?
•                  .xpath(‘‘xpath选择器‘，first = True)
?
•                  .search(‘模板’)                  result对象(匹配第一次)
?
•                       （‘xxx{}yyy{}’）[0]
?
•                      （‘xxx{name}yyy{pwd}’）[‘name’]
?
•                  .search_all(‘模板‘)             匹配所有,[result对象,result对象,....]
?
•                  .render(.....)               渲染后的结果去替换 r.html.html
?
•                   **参数：**
?
•                           script：“”“ ( ) => {
?
•                                       js代码
?
•                                       js代码
?
•                                   }
?
•                                 ”“”
?
•                           scrolldown：n
?
•                           sleep:n
?
•                           keep_page:True/False
?
?
                            绕过网站对webdriver的检测:
                            ‘‘‘
                            () =>{
                                Object.defineProperties(navigator,{
                                webdriver:{
                                    get: () => undefined
                                    }
                                })
                            }
                            ‘‘‘

Element对象方法及属性

        element对象 .absolute_links
                    .links
                   .text
                   .html
                   .attrs
                   .find(‘css选择器‘)
                   .search(‘模板‘)
                   .search_all(‘模板‘)

与浏览器交互 r.html.page.XXX

                async def xxx():
?
•                   await r.html.page.XXX
?
•               session.loop.run....(xxx())
?
?
•           .screenshot({‘path‘:路径,‘clip‘:{‘x‘:1,‘y‘:1,‘width‘:100,‘height‘:100}})
?
•           .evaluate(‘‘‘() =>{js代码}’‘’})
?
•           .cookies()
?
•           .type(‘css选择器‘，’内容‘，{’delay‘：100})
?
•           .click(‘css选择器‘,{‘button‘:‘left‘,‘clickCount‘:1,‘delay‘:0})
?
•           .focus(‘css选择器‘)
?
•           .hover(‘css选择器‘)
?
•           .waitForSelector(‘css选择器‘)
?
•           .waitFor(1000)

键盘事件 r.html.page.keyboard.XXX

            .down(‘Shift‘)
?
•           .up(‘Shift‘)
?
•           .press(‘ArrowLeft‘)
?
•           .type(‘喜欢你啊‘，{‘delay’:100})

鼠标事件 r.html.page.mouse.XXX

            .click(x,y,{
                ‘button‘：‘left‘,
                ‘click‘:1
                ‘delay‘:0
            })
            .down({‘button‘：‘left‘})
            .up({‘button‘：‘left‘})
            .move(x,y,{‘steps‘：1})

常用数据库

###mongoDB4.0:

下载:https://www.mongodb.com/

安装:略

注意:使用前修改bin目录下配置文件mongodb.cfg,删除最后一行的‘mp‘字段

####1. 启动服务与终止服务

net start mongodb
?
net stop mongodb

2.创建管理员用户

mongo
?
use admin
?
db.createUser({user:"yxp",pwd:"997997",roles:["root"]})

3.使用账户密码连接mongodb

mongo -u adminUserName -p userPassword

4.数据库

查看数据库

show dbs

切换数据库

use db_name

增加数据库

db.table1.insert({‘a‘:1})  创建数据库(切换到数据库插入表及数据)

删除数据库

db.dropDatabase()  删数据库(删前要切换)

5.表

使用前先切换数据库

查看表

show tables 查所有的表

增加表

db.table1.insert({‘b‘:2})  增加表(表不存在就创建)

删除表

db.table1.drop()    删表

数据

增加数据

db.test.insert(user0)    插入一条
db.user.insertMany([user1,user2,user3,user4,user5])   插入多条

删除数据

db.user.deleteOne({ ‘age‘: 8 })   删第一个匹配
db.user.deleteMany( {‘addr.country‘: ‘China‘} )  删全部匹配
db.user.deleteMany({})  删所有

查看数据

db.user.find({‘name‘:‘alex‘})   查xx==xx
db.user.find({‘name‘:{"$ne":‘alex‘}})   查xx!=xx
db.user.find({‘_id‘:{‘$gt‘:2}})    查xx>xx
db.user.find({"_id":{"$gte":2,}})  查xx>=xx
db.user.find({‘_id‘:{‘$lt‘:3}})  查xx<xx
db.user.find({"_id":{"$lte":2}})  查xx<=xx

改数据

db.user.update({‘_id‘:2},{"$set":{"name":"WXX",}})   改数据

pymongo

conn = pymongo.MongoClient(host=host,port=port, username=username, password=password)
db = client["db_name"] 切换数据库
table = db[‘表名‘]
table.insert({})  插入数据
table.remove({})   删除数据
table.update({‘_id‘:2},{"$set":{"name":"WXX",}})   改数据
table.find({})  查数据

爬虫与反爬虫的对抗历史

技术分享图片

常见反扒手段

1.检测浏览器headers

2.ip封禁

3.图片验证码

4.滑动模块

5.js加密算法

5.js轨迹

6.前端反调试

小爬爬

1.爬校花图片(模仿校花的都得死)

2.爬豆瓣电影

3.校花电影m3u8(凉凉夜色)

4.爬取天猫

反爬虫:使用技术手段防止爬虫程序的方法

误伤:反扒技术将普通用户识别为爬虫,如果误伤过高,效果再好也不能用

成本:反爬虫需要的人力和机器成本

拦截:成功拦截爬虫,一般情况下,拦截率越高,误伤率越高

5.分析腾讯视频url

接口:https://p2p.1616jx.com/api/api.php?url=vip视频地址

mitmproxy基本用法:
class XXX():
    def request(self,flow:mitmproxy.http.HTTPFlow):
        捕获请求
     def response(self,flow: mitmproxy.http.HTTPFlow):
        捕获响应
?
addons = [
    Vip_film()
]   
        
flow.request.headers ---- 获取请求头
flow.request.url ---- 获取请求url
flow.response.get_text() ---- 获取响应体
flow.response.set_text() ---- 设置响应体
?
运行:mitmdump -s 脚本.py

播放器:
<script src="//imgcache.qq.com/open/qcloud/video/vcplayer/TcPlayer-2.3.1.js" charset="utf-8"></script>;
        <div class="mod_player" id="mod_player" r-notemplate="true"></div>

给播放器传值:
var player =  new TcPlayer(‘mod_player‘, {
                "m3u8":m3u8播放地址,
                "autoplay" : true,      //iOS 下 safari 浏览器，以及大部分移动端浏览器是不开放视频自动播放这个能力的
                "width" :  ‘100%%‘,//视频的显示宽度，请尽量使用视频分辨率宽度
                "height" : ‘100%%‘//视频的显示高度，请尽量使用视频分辨率高度
            })

6.登录知乎

保存cookie到本地

jsdom使用:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;

7.红薯小说(js注入)

script=‘‘‘
var span_list = document.getElementsByTagName("span")
for (var i=0;i<span_list.length;i++){
    var content = window.getComputedStyle(
        span_list[i], ‘:before‘
    ).getPropertyValue(‘content‘);
    span_list[i].innerText = content.replace(‘"‘,"").replace(‘"‘,"");
}
‘‘‘

爬虫-案例

原文：https://www.cnblogs.com/Gaimo/p/12057459.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)

爬虫-案例

1、爬虫是啥？

2、http协议里需要关注的

2.1 请求需要关注的东西 requests

2.2 请求需要关注的东西 response

3、 常用请求库、解析库、数据库的用法

3.1 常用请求库 测试网站：http://httpbin.org/get

request库

3.2 常用解析语法

css选择器

xpath选择器

3.3 牛逼的requests-html

常用数据库

2.创建管理员用户

3.使用账户密码连接mongodb

4.数据库

查看数据库

切换数据库

增加数据库

删除数据库

5.表

查看表

增加表

删除表

数据

增加数据

删除数据

查看数据

改数据

pymongo

爬虫与反爬虫的对抗历史

常见反扒手段

小爬爬

3、常用请求库、解析库、数据库的用法

3.1 常用请求库测试网站：http://httpbin.org/get