路飞学城-Python爬虫实战密训-第2章

时间：2018-07-07 11:07:39 阅读：159 评论：0 收藏：0 [点我收藏+]

看完第二章了。由于缺少Web以及Flask相关知识储备，所以比较艰难。不过这都不是问题，我学习的目的就是要用，能让我深刻体会到这些知识的必要性，那么我也一定会再花时间把相关的课补上。班主任很负责任，还专门发了Flask专题学习视频给我们。第一章的作业被老师在群里表扬了，得了90分，还是蛮开心的。感觉自己私下学，和有老师带着学，最大的差别，第一是通过老师的点评和纠正，及时认识自己程序的不足，让程序更具有规范性；第二是不再盲目，可以按部就班，由浅入深的开展学习，同时掌握最为关键的——学习方法。

第二章的知识笔记：

遇到图片防盗链功能，请求头要加：

—Referer

—Cookie

—Host

Beautifulsoup4内置方法：

1. name，标签名称

1 # tag = soup.find(‘a‘)
2 # name = tag.name # 获取
3 # print(name)
4 # tag.name = ‘span‘ # 设置
5 # print(soup)

2. arrt，标签属性

1 # tag = soup.find(‘a‘)
2 # attrs = tag.attrs    # 获取
3 # print(attrs)
4 # tag.attrs = {‘ik‘:123} # 设置
5 # tag.attrs[‘id‘] = ‘iiiii‘ # 设置
6 # print(soup)

3. children，所有子标签

1 # from bs4.element import Tag
2 # div = soup.find(‘body‘)
3 # v = div.children
4 # print(list(v))          #含换行符 
5 # for ele in v:
6     If type(ele) == Tag:
7         print(ele)      #去掉换行符

4. children,所有子子孙孙标签

1 # body = soup.find(‘body‘)
2 # v = body.descendants      #返回所有标签，以及标签里的子标签

5. clear,将标签的所有子标签全部清空（保留标签名）

1 # tag = soup.find(‘body‘)
2 # tag.clear()
3 # print(soup)

6. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

1 # body = soup.find(‘body‘)
2 # v = body.decode()
3 # v = body.decode_contents()
4 # print(v)

7. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

1 # body = soup.find(‘body‘)
2 # v = body.encode()
3 # v = body.encode_contents()
4 # print(v)

8. find,获取匹配的第一个标签

1 # tag = soup.find(‘a‘)
2 # print(tag)
3 # tag = soup.find(name=‘a‘, attrs={‘class‘: ‘sister‘}, recursive=True, text=‘Lacie‘)
4 # tag = soup.find(name=‘a‘, class_=‘sister‘, recursive=True, text=‘Lacie‘)
5 # print(tag)

9. find_all,获取匹配的所有标签

 1 # tags = soup.find_all(‘a‘)
 2 # print(tags)
 3  
 4 # tags = soup.find_all(‘a‘,limit=1)              #limit=1代表找几个
 5 # print(tags)
 6  
 7 # tags = soup.find_all(name=‘a‘, attrs={‘class‘: ‘sister‘}, recursive=True, text=‘Lacie‘)
 8 # # tags = soup.find(name=‘a‘, class_=‘sister‘, recursive=True, text=‘Lacie‘)
 9 # print(tags)
10  
11  
12 # ####### 列表，代表或的关系 #######
13 # v = soup.find_all(name=[‘a‘,‘div‘])           #name=’a’或者name=’div’
14 # print(v)
15  
16 # v = soup.find_all(class_=[‘sister0‘, ‘sister‘])
17 # print(v)
18  
19 # v = soup.find_all(text=[‘Tillie‘])
20 # print(v, type(v[0]))
21  
22  
23 # v = soup.find_all(id=[‘link1‘,‘link2‘])
24 # print(v)
25  
26 # v = soup.find_all(href=[‘link1‘,‘link2‘])
27 # print(v)
28  
29 # ####### 正则 #######
30 import re
31 # rep = re.compile(‘p‘)
32 # rep = re.compile(‘^p‘)                    #以P开头
33 # v = soup.find_all(name=rep)
34 # print(v)
35  
36 # rep = re.compile(‘sister.*‘)
37 # v = soup.find_all(class_=rep)               #样式里含有sister开头的
38 # print(v)
39  
40 # rep = re.compile(‘http://www.oldboy.com/static/.*‘)     
41 # v = soup.find_all(href=rep)
42 # print(v)
43  
44 # ####### 方法筛选 #######
45 # def func(tag):
46 # return tag.has_attr(‘class‘) and tag.has_attr(‘id‘)     #既具有class属性又有id属性
47 # v = soup.find_all(name=func)
48 # print(v)
49  
50 # ## get,获取标签属性
51 # tag = soup.find(‘a‘)
52 # v = tag.get(‘id‘)
53 # print(v)

10. get_text,获取标签内部文本内容

1 # tag = soup.find(‘a‘)
2 # v = tag.get_text(‘id‘)
3 # print(v)

11. index,检查标签在某标签中的索引位置

1 # tag = soup.find(‘body‘)
2 # v = tag.index(tag.find(‘div‘))
3 # print(v)
4  
5 # tag = soup.find(‘body‘)
6 # for i,v in enumerate(tag):
7 # print(i,v)

12. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签

判断是否是如下标签：‘br‘ , ‘hr‘, ‘input‘, ‘img‘, ‘meta‘,‘spacer‘, ‘link‘, ‘frame‘, ‘base‘

1 # tag = soup.find(‘br‘)
2 # v = tag.is_empty_element
3 # print(v)

13． select,select_one, CSS选择器

 1 soup.select("title")
 2  
 3 soup.select("p nth-of-type(3)")
 4  
 5 soup.select("body a")            #空格代表去它的子子孙孙里找；>a代表去下一级找
 6  
 7 soup.select("html head title")
 8  
 9 tag = soup.select("span,a")
10  
11 soup.select("head > title")
12  
13 soup.select("p > a")
14  
15 soup.select("p > a:nth-of-type(2)")
16  
17 soup.select("p > #link1")
18  
19 soup.select("body > a")
20  
21 soup.select("#link1 ~ .sister")            # #号代表id=link1
22  
23 soup.select("#link1 + .sister")
24  
25 soup.select(".sister")
26  
27 soup.select("[class~=sister]")
28  
29 soup.select("#link1")
30  
31 soup.select("a#link2")                  # a标签并且id=link1
32  
33 soup.select(‘a[href]‘)                   # a标签并且具有href属性的
34  
35 soup.select(‘a[href="http://example.com/elsie"]‘)
36  
37 soup.select(‘a[href^="http://example.com/"]‘)           #代表以这个网址开头的
38  
39 soup.select(‘a[href$="tillie"]‘)                        # $代表以这个结尾的
40  
41 soup.select(‘a[href*=".com/el"]‘)

路飞学城-Python爬虫实战密训-第2章

原文：https://www.cnblogs.com/shajing/p/9276632.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)