第一种:
例子:这里我使用"http://www.simple-style.com/page/1"这个网站的爬虫
>>>scrapy shell http://www.simple-style.com/page/1
进入交互环境后,我想找到当前网页的所有src
1 >>> response.xpath(‘//@src‘).extract() 2 [‘http://www.simple-style.com/wp-includes/js/jquery/jquery.js?ver=1.12.4‘, ‘http://www.simple-style.com/wp-includes/js/jquery/jquery-migrate.m 3 in.js?ver=1.4.1‘, ‘http://www.simple-style.com/wp-content/plugins/to-top/public/js/to-top-public.js?ver=1.0‘, ‘http://www.simple-style.com/wp- 4 content/uploads/2017/03/simple-logo.gif‘, ‘//v.qq.com/iframe/player.html?vid=e0386mjreck&tiny=0&auto=0‘, ‘http://www.simple-style.com/wp-conte 5 nt/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2017/03/ali_bosworth_01.jpg‘, ‘http:// 6 www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2017/03/the_warehouse_hotel_ 7 01.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2017/02/ahndraya_parlato_01.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/201 8 6/07/inner_self_04.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2016/07/Yuanghua-Chen-01.jpg‘, ‘http://www.simple-style.com/wp-conten 9 t/uploads/2016/07/01-alicephoebelou.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2016/06/02-Tim_Gao_Photography_Invisible_Theatre_17. 10 jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2016/05/4.png‘, ‘http://www.simple-style.com/wp-content/uploads/2016/05/01-Remona.jpg‘, 11 ‘http://www.simple-style.com/wp-content/uploads/2016/05/Nbr-h000-1.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2016/04/0501.jpg‘, ‘h 12 ttp://www.simple-style.com/wp-content/uploads/2016/04/01.jpg‘, ‘http://www.simple-style.com/wp-content/plugins/smartideo/static/smartideo.js?v 13 er=2.2.5‘, ‘http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/skip-link-focus-fix.js?ver=1.0‘, ‘http://www.simple-style. 14 com/wp-content/themes/twentyseventeen/assets/js/navigation.js?ver=1.0‘, ‘http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/ 15 js/global.js?ver=1.0‘, ‘http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/jquery.scrollTo.js?ver=2.1.2‘, ‘http://www.sim 16 ple-style.com/wp-includes/js/wp-embed.min.js?ver=4.7.3‘]
得到很多个src后,我想只取到"/2017/03"日上传的jpg的src,则可以使用正则
这里xpath后的对象不用extract(), re后会返回一个字符串列表,否则会报错
1 response.xpath(‘//@src‘).re(‘.*/2017/03/.*\.jpg‘) 2 [‘http://www.simple-style.com/wp-content/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/ 3 2017/03/ali_bosworth_01.jpg‘, ‘http://www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg‘, ‘http://www.simple-style.com/wp-conten 4 t/uploads/2017/03/the_warehouse_hotel_01.jpg‘]
第二种:
1 from scrapy.selector import Selector 2 from scrapy.http import HtmlResponse 3 html = """<!DOCTYPE html> 4 <html> 5 <head lang="en"> 6 <meta charset="UTF-8"> 7 <title></title> 8 </head> 9 <body> 10 <li class="item-"><a href="link.html">first item</a></li> 11 <li class="item-0"><a href="link1.html">first item</a></li> 12 <li class="item-1"><a href="link2.html">second item</a></li> 13 </body> 14 </html> 15 """ 16 response = HtmlResponse(url=‘http://example.com‘, body=html,encoding=‘utf-8‘) 17 ret = Selector(response=response).xpath(‘//li[re:test(@class, "item-\d*")]//@href‘).extract() 18 print(ret) 19 20 正则选择器
原文:http://www.cnblogs.com/Garvey/p/6697162.html