首页 > 编程语言 > 详细

路飞学城-Python爬虫集训-第1章

时间:2018-07-05 19:30:25      阅读:178      评论:0      收藏:0      [点我收藏+]

1 什么是爬虫

  百度百科:网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。

  通过Python可以快速的编写爬虫程序,来获取指定URL的资源。python爬虫用requestsbs4这两个模板就可以爬取很多资源了。

2 request

  request用到的常用两个方法为 get 和 post。

  由于网络上,大多数的url访问都是这两种访问,所以通过这两个方法可以获取大多数网络资源。

  这两个方法的主要参数如下:

    url:想要获取URL资源的链接。

    headers:请求头,由于很多网站都做了反爬虫。所以伪装好headers就能让网站无法释放是机器在访问。

    json:当访问需要携带json时加入。

    data:当访问需要携带data时加入,一般登录网站的用户名和密码都在data里。

    cookie:由于辨别用户身份,爬取静态网站不需要,但需要登录的网站就需要用到cookie。

    parmas:参数,有些url带id=1&user=starry等等,可以写进parmas这个参数里。

 

    timeout:设置访问超时时间,当超过这个时间没有获取到资源就停止。

    allow_redirects:有些url会重定向到另外一个url,设置为False可以自己不让它重定向。

    proxies:设置代理。

  以上参数是主要用到的参数。

3 登录抽屉并自动点赞

1、首先向 https://dig.chouti.com 访问获取cookie

1 r1 = requests.get(
2     url="https://dig.chouti.com/",
3     headers={
4         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
5     }
6 )
7 r1_cookie_dict = r1.cookies.get_dict()

2 、访问 https://dig.chouti.com/login 这个网页,并携带账户、密码和cookie。

 1 response_login = requests.post(
 2     url=https://dig.chouti.com/login,
 3     data={
 4         "phone":"xxx",
 5         "password":"xxx",
 6         "oneMonth":"1"
 7     },
 8     headers={
 9         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
10     },
11     cookies = r1_cookie_dict
12 )

3、获取点赞某个新闻需要的url,携带cookie访问这个url就可以访问了。

1 rst = requests.post(
2         url="https://dig.chouti.com/link/vote?linksId=20639606",
3         headers={
4             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
5         },
6         cookies = r1_cookie_dict
7     )

4 自动登入拉勾网并修改信息

1 拉勾网为了防访问,在headers里需要 X-Anit-Forge-Code 和 X-Anit-Forge-Token 这两个值,这两个值访问拉勾登录网页可以获取。

 1 r1 = requests.get(
 2     url=https://passport.lagou.com/login/login.html,
 3     headers={
 4         User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36,
 5         Host:passport.lagou.com
 6     }
 7 )
 8 all_cookies.update(r1.cookies.get_dict())
 9 
10 X_Anit_Forge_Token = re.findall(r"X_Anti_Forge_Token = ‘(.*?)‘;",r1.text,re.S)[0]
11 X_Anti_Forge_Code = re.findall(r"X_Anti_Forge_Code = ‘(.*?)‘;",r1.text,re.S)[0]

2 然后访问登录url,并将账户、密码和上面两个值携带进去。当然,headers里的值需要全一点。

 1 r2 = requests.post(
 2     url="https://passport.lagou.com/login/login.json",
 3     headers={
 4         Host:passport.lagou.com,
 5         "Referer":"https://passport.lagou.com/login/login.html",
 6         "X-Anit-Forge-Code": X_Anti_Forge_Code,
 7         "X-Anit-Forge-Token": X_Anit_Forge_Token,
 8         "X-Requested-With": "XMLHttpRequest",
 9         User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36,
10         "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
11     },
12     data={
13         "isValidate":"true",
14         "username":"xxx",
15         "password":"xxx",
16         "request_form_verifyCode":"",
17         "submit":""
18     },
19     cookies=r1.cookies.get_dict()
20 )
21 all_cookies.update(r2.cookies.get_dict())

3 当然,获取了登录成功的cookie也不一定可以修改信息。这时还需要用户授权。访问 https://passport.lagou.com/grantServiceTicket/grant.html这个网页就可以获取到

1 r3 = requests.get(
2     url=https://passport.lagou.com/grantServiceTicket/grant.html,
3     headers={
4         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
5     },
6     allow_redirects = False,
7     cookies = all_cookies
8 )
9 all_cookies.update(r3.cookies.get_dict())

4 用户授权会重定向一系列的url,我们需要把重定向的网页的cookie全部拿到。重定向的网页在上一个url中的Location里。

技术分享图片
 1 r4 = requests.get(
 2     url=r3.headers[Location],
 3     headers={
 4         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
 5     },
 6     allow_redirects=False,
 7     cookies=all_cookies
 8 )
 9 all_cookies.update(r4.cookies.get_dict())
10 
11 print(r5,r4.headers[Location])
12 r5 = requests.get(
13     url=r4.headers[Location],
14     headers={
15         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
16     },
17     allow_redirects=False,
18     cookies=all_cookies
19 )
20 all_cookies.update(r5.cookies.get_dict())
21 
22 print(r6,r5.headers[Location])
23 r6 = requests.get(
24     url=r5.headers[Location],
25     headers={
26         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
27     },
28     allow_redirects=False,
29     cookies=all_cookies
30 )
31 all_cookies.update(r6.cookies.get_dict())
32 
33 print(r7,r6.headers[Location])
34 r7 = requests.get(
35     url=r6.headers[Location],
36     headers={
37         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
38     },
39     allow_redirects=False,
40     cookies=all_cookies
41 )
42 all_cookies.update(r7.cookies.get_dict())
用户认证

5 接下来就是修改信息了,修改信息时访问的url需要传入submitCode和submitToken这两个值,通过分析可以得到这两个值在访问 https://gate.lagou.com/v1/neirong/account/users/0/这里url的返回值中可以获取。

获取submitCode和submitToken:

 1 r6 = requests.get(
 2     url=https://gate.lagou.com/v1/neirong/account/users/0/,
 3     headers={
 4         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36",
 5         X-L-REQ-HEADER: "{deviceType:1}",
 6         Origin: https://account.lagou.com,
 7         Host: gate.lagou.com,
 8     },
 9     cookies = all_cookies
10 )
11 all_cookies.update(r6.cookies.get_dict())
12 r6_json = r6.json()

修改信息:

 1 r7 = requests.put(
 2     url=https://gate.lagou.com/v1/neirong/account/users/0/,
 3     headers={
 4         User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36,
 5         Origin: https://account.lagou.com,
 6         Host: gate.lagou.com,
 7         X-Anit-Forge-Code: r6_json[submitCode],
 8         X-Anit-Forge-Token: r6_json[submitToken],
 9         X-L-REQ-HEADER: "{deviceType:1}",
10     },
11     cookies=all_cookies,
12     json={"userName": "Starry", "sex": "MALE", "portrait": "images/myresume/default_headpic.png",
13           "positionName": ..., "introduce": ....}
14 )

整体代码如下:

技术分享图片
  1 #!/usr/bin/env python
  2 # -*- coding: utf-8 -*-
  3 
  4 ‘‘‘
  5 @auther: Starry
  6 @file: 5.修改个人信息.py
  7 @time: 2018/7/5 11:43
  8 ‘‘‘
  9 
 10 import re
 11 import requests
 12 
 13 
 14 all_cookies = {}
 15 #########################1、查看登录界面 ###################
 16 r1 = requests.get(
 17     url=https://passport.lagou.com/login/login.html,
 18     headers={
 19         User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36,
 20         Host:passport.lagou.com
 21     }
 22 )
 23 all_cookies.update(r1.cookies.get_dict())
 24 
 25 X_Anit_Forge_Token = re.findall(r"X_Anti_Forge_Token = ‘(.*?)‘;",r1.text,re.S)[0]
 26 X_Anti_Forge_Code = re.findall(r"X_Anti_Forge_Code = ‘(.*?)‘;",r1.text,re.S)[0]
 27 print(X_Anit_Forge_Token,X_Anti_Forge_Code)
 28 #################2.用户名密码登录 ################################
 29 r2 = requests.post(
 30     url="https://passport.lagou.com/login/login.json",
 31     headers={
 32         Host:passport.lagou.com,
 33         "Referer":"https://passport.lagou.com/login/login.html",
 34         "X-Anit-Forge-Code": X_Anti_Forge_Code,
 35         "X-Anit-Forge-Token": X_Anit_Forge_Token,
 36         "X-Requested-With": "XMLHttpRequest",
 37         User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36,
 38         "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
 39     },
 40     data={
 41         "isValidate":"true",
 42         "username":"xxx",
 43         "password":"xxx",
 44         "request_form_verifyCode":"",
 45         "submit":""
 46     },
 47     cookies=r1.cookies.get_dict()
 48 )
 49 all_cookies.update(r2.cookies.get_dict())
 50 
 51 ###################3 用户授权 #########################
 52 r3 = requests.get(
 53     url=https://passport.lagou.com/grantServiceTicket/grant.html,
 54     headers={
 55         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
 56     },
 57     allow_redirects = False,
 58     cookies = all_cookies
 59 )
 60 all_cookies.update(r3.cookies.get_dict())
 61 
 62 ###################4 用户认证 #########################
 63 print(r4,r3.headers[Location])
 64 r4 = requests.get(
 65     url=r3.headers[Location],
 66     headers={
 67         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
 68     },
 69     allow_redirects=False,
 70     cookies=all_cookies
 71 )
 72 all_cookies.update(r4.cookies.get_dict())
 73 
 74 print(r5,r4.headers[Location])
 75 r5 = requests.get(
 76     url=r4.headers[Location],
 77     headers={
 78         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
 79     },
 80     allow_redirects=False,
 81     cookies=all_cookies
 82 )
 83 all_cookies.update(r5.cookies.get_dict())
 84 
 85 print(r6,r5.headers[Location])
 86 r6 = requests.get(
 87     url=r5.headers[Location],
 88     headers={
 89         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
 90     },
 91     allow_redirects=False,
 92     cookies=all_cookies
 93 )
 94 all_cookies.update(r6.cookies.get_dict())
 95 
 96 print(r7,r6.headers[Location])
 97 r7 = requests.get(
 98     url=r6.headers[Location],
 99     headers={
100         User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36
101     },
102     allow_redirects=False,
103     cookies=all_cookies
104 )
105 all_cookies.update(r7.cookies.get_dict())
106 
107 ###################5. 查看个人页面 #########################
108 
109 r5 = requests.get(
110     url=https://www.lagou.com/resume/myresume.html,
111     headers={
112         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
113     },
114     cookies = all_cookies
115 )
116 
117 print("李港" in r5.text)
118 
119 
120 ###################6. 查看 #########################
121 r6 = requests.get(
122     url=https://gate.lagou.com/v1/neirong/account/users/0/,
123     headers={
124         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36",
125         X-L-REQ-HEADER: "{deviceType:1}",
126         Origin: https://account.lagou.com,
127         Host: gate.lagou.com,
128     },
129     cookies = all_cookies
130 )
131 all_cookies.update(r6.cookies.get_dict())
132 r6_json = r6.json()
133 print(r6.json())
134 # ################################ 7.修改信息 ##########################
135 r7 = requests.put(
136     url=https://gate.lagou.com/v1/neirong/account/users/0/,
137     headers={
138         User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36,
139         Origin: https://account.lagou.com,
140         Host: gate.lagou.com,
141         X-Anit-Forge-Code: r6_json[submitCode],
142         X-Anit-Forge-Token: r6_json[submitToken],
143         X-L-REQ-HEADER: "{deviceType:1}",
144     },
145     cookies=all_cookies,
146     json={"userName": "Starry", "sex": "MALE", "portrait": "images/myresume/default_headpic.png",
147           "positionName": ..., "introduce": ....}
148 )
149 print(r7.text)
自动登录拉勾网并修改信息

5 作业 自动登录GitHub并获取信息

运行如下:

技术分享图片

源代码:

技术分享图片
  1 #-*- coding:utf-8 -*-
  2 
  3 ‘‘‘
  4 @auther: Starry
  5 @file: github_login.py
  6 @time: 2018/6/22 16:22
  7 ‘‘‘
  8 
  9 import  requests
 10 from bs4 import BeautifulSoup
 11 import bs4
 12 
 13 
 14 headers = {
 15     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
 16     "Accept-Encoding": "gzip, deflate, br",
 17     "Accept-Language": "zh-CN,zh;q=0.9",
 18     "Cache-Control": "max-age=0",
 19     "Connection": "keep-alive",
 20     "Host": "github.com",
 21     "Upgrade-Insecure-Requests": "1",
 22     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
 23 }
 24 def login(user, password):
 25     ‘‘‘
 26     通过已知的账号和密码自动登入GitHub,并返回该账号的Cookie
 27     :param user: 登入账号
 28     :param password: 登入密码
 29     :return: 返回Cookie
 30     ‘‘‘
 31     r1 = requests.get(
 32         url = "https://github.com/login",
 33         headers = headers
 34     )
 35     cookies_dict = r1.cookies.get_dict()
 36     soup = BeautifulSoup(r1.text,"html.parser")
 37     authenticity_token = soup.find(input,attrs={"name":"authenticity_token"}).get(value)
 38     # print(authenticity_token)
 39     res = requests.post(
 40         url = "https://github.com/session",
 41         data = {
 42             "commit":"Sign in",
 43             "utf8":"?",
 44             "authenticity_token":authenticity_token,
 45             "login":user,
 46             "password":password
 47         },
 48         headers = headers,
 49         cookies = cookies_dict
 50     )
 51     cookies_dict.update(res.cookies.get_dict())
 52     return cookies_dict
 53 
 54 def View_Information(cookies_dict):
 55     ‘‘‘
 56     展示GitHub的个人信息
 57     :param cookies_dict: 账号的Cookie值
 58     :return:
 59     ‘‘‘
 60     url_setting_profile = "https://github.com/settings/profile"
 61     html = requests.get(
 62         url=url_setting_profile,
 63         cookies = cookies_dict,
 64         headers=headers
 65     )
 66     soup = BeautifulSoup(html.text,"html.parser")
 67     user_name = soup.find(input,attrs={id:user_profile_name}).get(value)
 68     user_email = []
 69     select = soup.find(select,attrs={class:form-select select})
 70     for index,child in enumerate(select):
 71         if index == 0:
 72             continue
 73         if isinstance(child,bs4.element.Tag):
 74             user_email.append(child.get(value))
 75     Bio = soup.find(textarea, attrs={id: user_profile_bio}).text
 76     URL = soup.find(input,attrs={id:user_profile_blog}).get(value)
 77     Company = soup.find(input,attrs={id:user_profile_company}).get(value)
 78     Location = soup.find(input,attrs={id:user_profile_location}).get(value)
 79     print(‘‘‘
 80     用户名为:{0}
 81     绑定的邮箱地址为:{1}
 82     个性签名为:{2}
 83     主页为:{3}
 84     公司为:{4}
 85     位置为:{5}
 86     ‘‘‘.format(user_name,user_email,Bio,URL,Company,Location))
 87     html = requests.get(
 88         url="https://github.com/settings/repositories",
 89         cookies = cookies_dict,
 90         headers = headers
 91     )
 92     html.encoding = html.apparent_encoding
 93     soup = BeautifulSoup(html.text, "html.parser")
 94     div = soup.find(name=div,attrs={"class":"listgroup"})
 95     # print(div)
 96     tplt = "{0:^20}\t{1:^20}\t{2:^20}"
 97     print(tplt.format("项目名称","项目链接","项目大小"))
 98     for child in div.children:
 99         if isinstance(child, bs4.element.Tag):
100             a = child.find(a,attrs={"class":"mr-1"})
101             small = child.find(small)
102             print(tplt.format(a.text, a.get("href"), small.text))
103 
104 if __name__ == __main__:
105     user = xxx
106     password = xxx
107     cookies_dict = login(user,password)
108     View_Information(cookies_dict)
自动登录GitHub并获取信息

 6 总结 

  通过这节学习,让我能简单获取一些网站的信息了。以前是只能获取静态网页,不知道获取要登录的网站的资源。这次后能获奖一些需要登录功能网站的资源了,而且也对requests模板更加熟练了。更重要的是,当爬虫时,怎么分析网页,这个学到了很多。

路飞学城-Python爬虫集训-第1章

原文:https://www.cnblogs.com/xingkongyihao/p/9267733.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!