首页 > 其他 > 详细

scrapy中的xpath中的re使用

时间:2017-04-12 01:38:00      阅读:428      评论:0      收藏:0      [点我收藏+]

第一种:

 

例子:这里我使用"http://www.simple-style.com/page/1"这个网站的爬虫

>>>scrapy shell  http://www.simple-style.com/page/1

进入交互环境后,我想找到当前网页的所有src

 1 >>> response.xpath(//@src).extract()
 2 [http://www.simple-style.com/wp-includes/js/jquery/jquery.js?ver=1.12.4, http://www.simple-style.com/wp-includes/js/jquery/jquery-migrate.m
 3 in.js?ver=1.4.1, http://www.simple-style.com/wp-content/plugins/to-top/public/js/to-top-public.js?ver=1.0, http://www.simple-style.com/wp-
 4 content/uploads/2017/03/simple-logo.gif, //v.qq.com/iframe/player.html?vid=e0386mjreck&tiny=0&auto=0, http://www.simple-style.com/wp-conte
 5 nt/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg, http://www.simple-style.com/wp-content/uploads/2017/03/ali_bosworth_01.jpg, http://
 6 www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg, http://www.simple-style.com/wp-content/uploads/2017/03/the_warehouse_hotel_
 7 01.jpg, http://www.simple-style.com/wp-content/uploads/2017/02/ahndraya_parlato_01.jpg, http://www.simple-style.com/wp-content/uploads/201
 8 6/07/inner_self_04.jpg, http://www.simple-style.com/wp-content/uploads/2016/07/Yuanghua-Chen-01.jpg, http://www.simple-style.com/wp-conten
 9 t/uploads/2016/07/01-alicephoebelou.jpg, http://www.simple-style.com/wp-content/uploads/2016/06/02-Tim_Gao_Photography_Invisible_Theatre_17.
10 jpg, http://www.simple-style.com/wp-content/uploads/2016/05/4.png, http://www.simple-style.com/wp-content/uploads/2016/05/01-Remona.jpg,
11 http://www.simple-style.com/wp-content/uploads/2016/05/Nbr-h000-1.jpg, http://www.simple-style.com/wp-content/uploads/2016/04/0501.jpg, h
12 ttp://www.simple-style.com/wp-content/uploads/2016/04/01.jpg, http://www.simple-style.com/wp-content/plugins/smartideo/static/smartideo.js?v
13 er=2.2.5, http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/skip-link-focus-fix.js?ver=1.0, http://www.simple-style.
14 com/wp-content/themes/twentyseventeen/assets/js/navigation.js?ver=1.0, http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/
15 js/global.js?ver=1.0, http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/jquery.scrollTo.js?ver=2.1.2, http://www.sim
16 ple-style.com/wp-includes/js/wp-embed.min.js?ver=4.7.3]

得到很多个src后,我想只取到"/2017/03"日上传的jpg的src,则可以使用正则

这里xpath后的对象不用extract(), re后会返回一个字符串列表,否则会报错

1 response.xpath(//@src).re(.*/2017/03/.*\.jpg)
2 [http://www.simple-style.com/wp-content/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg, http://www.simple-style.com/wp-content/uploads/
3 2017/03/ali_bosworth_01.jpg, http://www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg, http://www.simple-style.com/wp-conten
4 t/uploads/2017/03/the_warehouse_hotel_01.jpg]

 

第二种:

 1 from scrapy.selector import Selector
 2 from scrapy.http import HtmlResponse
 3 html = """<!DOCTYPE html>
 4 <html>
 5 <head lang="en">
 6     <meta charset="UTF-8">
 7     <title></title>
 8 </head>
 9 <body>
10     <li class="item-"><a href="link.html">first item</a></li>
11     <li class="item-0"><a href="link1.html">first item</a></li>
12     <li class="item-1"><a href="link2.html">second item</a></li>
13 </body>
14 </html>
15 """
16 response = HtmlResponse(url=http://example.com, body=html,encoding=utf-8)
17 ret = Selector(response=response).xpath(//li[re:test(@class, "item-\d*")]//@href).extract()
18 print(ret)
19 
20 正则选择器

 

scrapy中的xpath中的re使用

原文:http://www.cnblogs.com/Garvey/p/6697162.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!