第二天下午以及第三天,完成了一个还算简单的爬虫,只是抓取了一个美国的官网。健壮性比较差~~~ 使用xpath抓取时,有些迷茫。原因是网站做的标签有些混乱。或者说是自己经验比较少吧,以后继续补充些这些方面的知识。
<!-- 代码比较多 --> <!-- http://www.hugoboss.com/uk/extra-slim-fit-jacket-%27ryan_cyl%27-in-a-new-wool-blend/hbeu50275029.html?cgid=21600&dwvar_hbeu50275029_color=410_Dark%20Blue --> <html> <head></head> <body> <div class="product-tabs tabs"> <h2 class="visually-hidden">Additional Information</h2> <ul class="tabs-menu clearfix"> <li class="active"> <div class="outerContainer"> <div class="innerContainer"> <div class="element"> <h2><a href="#tab1">Description</a></h2> </div> </div> </div> </li> <li> <div class="outerContainer"> <div class="innerContainer"> <div class="element"> <h2><a href="#tab2">Details</a></h2> </div> </div> </div> </li> <li> <div class="outerContainer"> <div class="innerContainer"> <div class="element"> <h2><a href="#tab3">Material & care</a></h2> </div> </div> </div> </li> </ul> <div style="height: 100px;" id="tab1" class="tab-content active"> <a class="print-page button">Print</a> Sophisticated jacket (extra slim-fit) by BOSS made from a new-wool blend in a plain design. This classic, single-breasted business jacket fits perfectly thanks to the 2 side back vents, 2 waist darts at the front and 3 panel seams at the back. The top-quality finish of its design includes decorative stitching and fine felt at <span style="display: inline;" class="showMore">(Show more…)</span> <span style="display: none;" class="fullText"> the undercollar of the notch lapels, and is an exciting item in the Create your Look series. The classic styling of the 2-button jacket, with 1 breast welt pocket and 2 piped pockets with tucked-down flaps, makes simple and more complex style options possible with a variety of suit trousers or waistcoats. In this way, a totally personal look is created that can be adapted to suit different occasions. </span> </div> <div style="height: 100px;" id="tab2" class="tab-content" itemprop="description"> <a class="print-page button">Print</a> Extra slim fit <br />New-wool blend with polyamide and elastane <br />Plain design <br />2-button jacket, 2 side vents <span class="showMore">(Show more…)</span> <span style="display: none;" class="fullText"> <br />Single-breasted<br />Notch lapel with decorative stitching and fine felt on the undercollar<br />1 breast welt pocket, 2 piped pockets with turned down flaps<br />2 waist darts at the front, 3 panel seams at the back<br />4 kissing buttons on the cuff<br />Back length: 72 cm in size 48<br />Delivered in a HUGO BOSS garment bag </span> </div> <div style="height: 100px;" id="tab3" class="tab-content"> <div class="material-info-text"> <p class="productinfo-text"> Material information: 85% Virgin wool, 11% Polyamid, 4% Elastane, Lining: 52% Acetate, 48% Viscose, Sleeve lining: 51% Viscose, 49% Acetate </p> <p class="productinfo-text"> Do Not Wash, Iron Low Heat, Do Not Bleach, Reduced Dryclean P, Do Not Tumble Dry </p> </div> </div> </div> </body> </html>
这里需要抓取 description 和 details 。算了,直接上代码吧。
@classmethod def fetch_description(cls, response, region=None, spider=None): """ 返回单品描述,不同行之间使用‘\r‘分隔 由于详细中,存在“打印”以及可能存在的“显示更多”标签。所以将所有文本取出,并替换 :param response: :param spider: :return: """ sel = Selector(response) description = None if region == ‘cn‘: description_node = sel.xpath(‘//div[@id="lyr1"][contains(@class,"description")]‘) else: description_node = sel.xpath(‘//div[contains(@class, "product-detail")]//div[@id="tab1"]‘) if description_node: try: description = ‘\r‘.join(cls.reformat(val) for val in description_node.xpath(‘.//text()‘).extract()) print_node = description_node.xpath(‘.//*[contains(@class, "print-page")]/text()‘).extract()[0] if print_node: print_node = cls.reformat(print_node) description = description.replace(print_node, ‘‘) show_more_node = description_node.xpath(‘.//*[contains(@class, "showMore")]/text()‘).extract()[0] if show_more_node: show_more_node = cls.reformat(show_more_node) description = description.replace(show_more_node, ‘‘) except(TypeError, IndexError): pass description = cls.reformat(description) return description
程序大体执行:先判断国家,根据国家的不同,xpath结点的选取有所不同。当结点存在时,继续向下执行,由于xpath.extract() 返回的是一个列表,所以要取值时,需要使用到列表的切片选取第一个元素。但是列表可能为空列表,对空列表执行[0]操作时,会报 IndexError 错误。所以使用 try ... except ... 来捕获异常,此时出现的异常不需要处理,直接向下执行就行。问题的关键就在 try... 下面的代码块中。之前的代码修改了三次,现在才正常。最早的代码如下:
if description_node: try: print_node = description_node.xpath(‘.//*[contains(@class, "print-page")]/text()‘).extract()[0] show_more_node = description_node.xpath(‘.//*[contains(@class, "showMore")]/text()‘).extract()[0] description = ‘\r‘.join(cls.reformat(val) for val in description_node.xpath(‘.//text()‘).extract()) if print_node: print_node = cls.reformat(print_node) description = description.replace(print_node, ‘‘) if show_more_node: show_more_node = cls.reformat(show_more_node) description = description.replace(show_more_node, ‘‘) except(TypeError, IndexError): pass
不难发现,这段代码存在严重的问题。当执行到 try 代码块中,说明存在 描述结点的。
但此时,如果 print_node 或 show_more_node 的xpath 返回空值时,他们就是空列表,程序便终止执行 try 中剩下的代码,直接进入 except 异常处理块中。修改完如下:
if description_node: try: description = ‘\r‘.join(cls.reformat(val) for val in description_node.xpath(‘.//text()‘).extract()) print_node = description_node.xpath(‘.//*[contains(@class, "print-page")]/text()‘).extract()[0] show_more_node = description_node.xpath(‘.//*[contains(@class, "showMore")]/text()‘).extract()[0] if print_node: print_node = cls.reformat(print_node) description = description.replace(print_node, ‘‘) if show_more_node: show_more_node = cls.reformat(show_more_node) description = description.replace(show_more_node, ‘‘) except(TypeError, IndexError): pass
此时如果html中存在 description,就一定能抓取到。但是代码中存在 ‘打印’和 可能存在 ‘显示更多’。通过执行发现‘打印’二字,时而出现时而消失。当时感觉挺奇怪的,然后又一想,可能是html代码有些变化,导致xpath提取不出来 print_node 。但是使用 scrapt shell url ,调试时发现可以取到 ‘打印’的。然后又单步调试,发现,执行到 show_more_node 后,直接就进入了 except 代码段。恍然明白,这段描述没有‘显示更多’,剩下的替换代码,没有执行。然后又修改代码:
if description_node: try: description = ‘\r‘.join(cls.reformat(val) for val in description_node.xpath(‘.//text()‘).extract()) print_node = description_node.xpath(‘.//*[contains(@class, "print-page")]/text()‘).extract()[0] if print_node: print_node = cls.reformat(print_node) description = description.replace(print_node, ‘‘) show_more_node = description_node.xpath(‘.//*[contains(@class, "showMore")]/text()‘).extract()[0] if show_more_node: show_more_node = cls.reformat(show_more_node) description = description.replace(show_more_node, ‘‘) except(TypeError, IndexError): pass
还需要注意的一点就是 try 中代码的顺序。因为这段最主要的目的就是抓取 description ,如果存在的话, ‘打印’结点可能存在,‘显示更多’也可能存在,但是‘打印’一定出现在‘显示更多’前面,所以顺序需要时: description -> print_node -> show_more_node
当然这也跟代码的书写有些关系。如果使用 if 来判断抓取返回的列表是否为空,就不用再使用 try 异常处理了。
# 在使用 try: pass # 一定要注意这里面语句的顺序 # 一旦出现异常,代码就会终止执行本块剩下的代码 except: pass # 所以 try 的使用,一定要谨慎 # ‘知道’ --》 ‘体会到’ 还是有一点距离的
一直在纠结,是谨慎使用,还是使用谨慎。好像都对,又好像都不太确切~ 哈哈~~~