scrapy 提供了一些命令行工具(Command line tool),之前创建 Project 的时候用到的startproject
就是其中之一。而除了这个之外,其他工具也各自提供了相当有用的功能。
$ scrapy
Scrapy 0.14.4 - project: lawson
Usage:
scrapy <command> [options] [args]
Available commands:
crawl Start crawling from a spider or URL
deploy Deploy project in Scrapyd target
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
server Start Scrapyd server for this project
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
shell
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
运行后会进入 Python Interpreter,在这里我们能进行各种试验,配合Firebug 之类的工具,为程序构建一个原型:
shell 会提前初始化一个selector变量 sel ,然后我们可以在命令行中对sel这个变量进行操作
Selectors有下面四个基础方法:
In [1]: sel.xpath(’//title’)【返回选取所有 title子元素,而不管它们在文档中的位置】
Out[1]: [<Selector xpath=’//title’ data=u’<title>Open Directory - Computers: Progr’>]
In [2]: sel.xpath(’//title’).extract()
Out[2]: [u’<title>Open Directory - Computers: Programming: Languages: Python: Books</title>’]
In [3]: sel.xpath(’//title/text()’)【选择在<title>节点中的元素 】
Out[3]: [<Selector xpath=’//title/text()’ data=u’Open Directory - Computers: Programming:’>]
In [4]: sel.xpath(’//title/text()’).extract()
Out[4]: [u’Open Directory - Computers: Programming: Languages: Python: Books’]
In [5]: sel.xpath(’//title/text()’).re(’(\w+):’)【正则匹配-返回一个元组】
Out[5]: [u’Computers’, u’Programming’, u’Languages’, u’Python’]
如果在执行 scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"后
输入
response.body【得到Html】response.headers【得到头】
又一个例子
假如我们要处理的页面由下面这样子的组成
<fieldset class="fieldcap"> <legend>See also:</legend> <ul class="directory"> <li> <a href="/Computers/Programming/Languages/Python/Resources/">Computers: Programming: Languages: Python: Resources</a> <em>(5)</em> </li> <li> <a href="/Computers/Programming/Languages/Ruby/Books/">Computers: Programming: Languages: Ruby: Books</a> <em>(7)</em> </li> </ul> </fieldset> <fieldset class="fieldcap fieldcapN"> <legend>This category in other languages:</legend> <ul class="language"> <li> <a href="/World/Deutsch/Computer/Programmieren/Sprachen/Python/B%C3%BCcher/">German</a> <em>(7)</em> </li> <li> <a href="/World/Russian/%D0%9A%D0%BE%D0%BC%D0%BF%D1%8C%D1%8E%D1%82%D0%B5%D1%80%D1%8B/%D0%9F%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%B8%D0%B5/%D0%AF%D0%B7%D1%8B%D0%BA%D0%B8/Python/%D0%9A%D0%BD%D0%B8%D0%B3%D0%B8/">Russian</a> <em>(3)</em> </li> </ul> </fieldset>
检查了这个页面,偶们发现我们要的信息放在了<ul> 元素中间,
其实是在下面的 <li>元素中间,那么我们可以这样提取:
sel.xpath(’//ul/li’)
页面的描述可以这样提取
sel.xpath(’//ul/li/text()’).extract()
页面标题可以这样提取
sel.xpath(’//ul/li/a/text()’).extract()
页面链接可以这样提取:
sel.xpath(’//ul/li/a/@href’).extract()
四个命令得到结果的:
.xpath()这个方法返回的是一个selectors的列表,所以我们可以把代码 写得更加简练一点
sites = sel.xpath(’//ul/li’) for site in sites: title = site.xpath(’a/text()’).extract() link = site.xpath(’a/@href’).extract() desc = site.xpath(’text()’).extract() print title, link, desc
#-*- coding: utf-8 -*- from scrapy.spider import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): sel = Selector(response) sites = sel.xpath(‘//ul/li‘) for site in sites: title = site.xpath(‘a/text()‘).extract() link = site.xpath(‘a/@href‘).extract() desc = site.xpath(‘text()‘).extract() print title, link, desc
而 shell
不仅能从命令行直接调用,还能从程序中调用直接进入以便分析程序做调试:
class LawsonSpider(BasePoiSpider):
...
def parse_geo(self, response):
inspect_response(response)
def parse_store_list(self, response):
...
parse_geo
时就会掉入 shell
界面,可以做进一步调试。
Scrapy爬虫笔记【4-Scrapy命令行】,布布扣,bubuko.com
原文:http://blog.csdn.net/yixiantian7/article/details/20862959