首页 > 其他 > 详细

Nutch2.2.1抓取流程

时间:2014-08-15 22:36:09      阅读:961      评论:0      收藏:0      [点我收藏+]




一、抓取流程概述
1、nutch抓取流程
当使用crawl命令进行抓取任务时,其基本流程步骤如下:
(1)InjectorJob
开始第一个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第二个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第三个迭代
……

2、抓取日志
使用crawl命令进行抓取时,console输出日志如下:

InjectorJob: starting at 2014-07-08 10:41:27
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:41:34
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787293-26339
Fetching : 
FetcherJob: starting
FetcherJob: batchId: 1404787293-26339
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798101129
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.csdn.net/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.itpub.net/ (queue crawl delay=5000ms)
-finishing thread FetcherThread47, activeThreads=48
-finishing thread FetcherThread46, activeThreads=47
-finishing thread FetcherThread45, activeThreads=46
-finishing thread FetcherThread44, activeThreads=45
-finishing thread FetcherThread43, activeThreads=44
-finishing thread FetcherThread42, activeThreads=43
-finishing thread FetcherThread41, activeThreads=42
-finishing thread FetcherThread40, activeThreads=41
-finishing thread FetcherThread39, activeThreads=40
-finishing thread FetcherThread38, activeThreads=39
-finishing thread FetcherThread37, activeThreads=38
-finishing thread FetcherThread36, activeThreads=37
-finishing thread FetcherThread35, activeThreads=36
-finishing thread FetcherThread34, activeThreads=35
-finishing thread FetcherThread33, activeThreads=34
-finishing thread FetcherThread32, activeThreads=33
-finishing thread FetcherThread31, activeThreads=32
-finishing thread FetcherThread30, activeThreads=31
-finishing thread FetcherThread29, activeThreads=30
-finishing thread FetcherThread48, activeThreads=29
-finishing thread FetcherThread27, activeThreads=29
-finishing thread FetcherThread26, activeThreads=28
-finishing thread FetcherThread25, activeThreads=27
-finishing thread FetcherThread24, activeThreads=26
-finishing thread FetcherThread23, activeThreads=25
-finishing thread FetcherThread22, activeThreads=24
-finishing thread FetcherThread21, activeThreads=23
-finishing thread FetcherThread20, activeThreads=22
-finishing thread FetcherThread19, activeThreads=21
-finishing thread FetcherThread18, activeThreads=20
-finishing thread FetcherThread17, activeThreads=19
-finishing thread FetcherThread16, activeThreads=18
-finishing thread FetcherThread15, activeThreads=17
-finishing thread FetcherThread14, activeThreads=16
-finishing thread FetcherThread13, activeThreads=15
-finishing thread FetcherThread12, activeThreads=14
-finishing thread FetcherThread11, activeThreads=13
-finishing thread FetcherThread10, activeThreads=12
-finishing thread FetcherThread9, activeThreads=11
-finishing thread FetcherThread8, activeThreads=10
-finishing thread FetcherThread7, activeThreads=9
-finishing thread FetcherThread5, activeThreads=8
-finishing thread FetcherThread4, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread2, activeThreads=5
-finishing thread FetcherThread49, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread28, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null
-finishing thread FetcherThread1, activeThreads=0
0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
Parsing : 
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1404787293-26339
Parsing http://www.csdn.net/
http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
Parsing http://www.itpub.net/
ParserJob: success
CrawlDB update for csdnitpub
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing csdnitpub on SOLR index -> http://ip:8983/solr/
SolrIndexerJob: starting
SolrIndexerJob: done.
SOLR dedup -> http://ip:8983/solr/
Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:42:19
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787338-30453
Fetching : 
FetcherJob: starting
FetcherJob: batchId: 1404787338-30453
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798146676
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0


二、使用命令进行逐步抓取

crawlDb, linkDb, a set of segments.
1、InjectorJob
此步骤将seed.txt中的url注入抓取队列中进行初始化。
(1)基本命令
[root@jediael local]# bin/nutch inject urls/
InjectorJob: starting at 2014-08-15 21:17:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 2
InjectorJob: total number of urls injected after normalization and filtering: 3
Injector: finished at 2014-08-15 21:17:06, elapsed: 00:00:05
其中urls/seed.txt的内容如下:
http://money.163.com/ 
http://www.hexun.com/
http://www.gw.com.cn/
(2)查看注入的url
上述步骤会在hbase中新建一个表,表名为test_1_webpage,url的相应内容会写入这张表

hbase(main):007:0> scan ‘test_1_webpage‘
ROW                              COLUMN+CELL                                                                       cn.com.gw.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          cn.com.gw.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             cn.com.gw.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               cn.com.gw.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   cn.com.gw.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     cn.com.gw.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           com.163.money:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                         com.163.money:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"               com.163.money:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                              com.163.money:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.163.money:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     com.163.money:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                          com.hexun.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          com.hexun.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             com.hexun.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               com.hexun.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.hexun.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                    com.hexun.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           3 row(s) in 0.1100 seconds

(3)关于**_webpage表
对于每一个任务,均会生成一个crawlId_webpage的表,所有已抓取及未抓取的url相关信息均会存入此表。
若url未抓取,则该url相应的行信息较少。若url已经抓取,则抓取到的内容也会放入该行,如网页内容等。

2、GeneratorJob
(1)基本命令
[root@jediael local]# bin/nutch generate -crawlId test_2

GeneratorJob: starting at 2014-08-15 21:24:49
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-15 21:24:55, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1408109089-403376773
(2)命令选项
[root@jediael local]# bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)"); 
   -noFilter      - do not activate the filter plugin to filter the url, default is true 
    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 

-adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.
-batchId - the batch id
----------------------
Please set the params.
3、FetcherJob
(1)基本命令
[root@jediael local]# bin/nutch fetch -all -crawlId test_2

FetcherJob: starting
FetcherJob: fetching all
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetching http://www.gw.com.cn/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.hexun.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread7, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread5, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=3
fetching http://money.163.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread9, activeThreads=3
-finishing thread FetcherThread1, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.6 1 pages/s, 307 307 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
4、ParserJob
(1)基本命令
[root@jediael local]# bin/nutch parse  -all -crawlId test_2

ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://www.gw.com.cn/
Parsing http://money.163.com/
Parsing http://www.hexun.com/
ParserJob: success
(2)命令参数
[root@jediael local]# bin/nutch parse 

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
<batchId> - symbolic batch ID created by Generator
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-all - consider pages from all crawl jobs
-resume - resume a previous incomplete job
-force - force re-parsing even if a page is already parsed

5、DbUpdaterJob
(1)基本命令
[root@jediael local]# bin/nutch updatedb

DbUpdaterJob: starting
DbUpdaterJob: done
6、SolrIndexerJob
(1)基本命令
[root@jediael local]# bin/nutch solrindex http://182.92.160.44:8583/solr/ -crawlId test_2

SolrIndexerJob: starting
SolrIndexerJob: done.
(2)命令参数
[root@jediael local]# bin/nutch solrindex 
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]



一、抓取流程概述
1、nutch抓取流程
当使用crawl命令进行抓取任务时,其基本流程步骤如下:
(1)InjectorJob
开始第一个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第二个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第三个迭代
……

2、抓取日志
使用crawl命令进行抓取时,console输出日志如下:

InjectorJob: starting at 2014-07-08 10:41:27
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:41:34
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787293-26339
Fetching : 
FetcherJob: starting
FetcherJob: batchId: 1404787293-26339
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798101129
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.csdn.net/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.itpub.net/ (queue crawl delay=5000ms)
-finishing thread FetcherThread47, activeThreads=48
-finishing thread FetcherThread46, activeThreads=47
-finishing thread FetcherThread45, activeThreads=46
-finishing thread FetcherThread44, activeThreads=45
-finishing thread FetcherThread43, activeThreads=44
-finishing thread FetcherThread42, activeThreads=43
-finishing thread FetcherThread41, activeThreads=42
-finishing thread FetcherThread40, activeThreads=41
-finishing thread FetcherThread39, activeThreads=40
-finishing thread FetcherThread38, activeThreads=39
-finishing thread FetcherThread37, activeThreads=38
-finishing thread FetcherThread36, activeThreads=37
-finishing thread FetcherThread35, activeThreads=36
-finishing thread FetcherThread34, activeThreads=35
-finishing thread FetcherThread33, activeThreads=34
-finishing thread FetcherThread32, activeThreads=33
-finishing thread FetcherThread31, activeThreads=32
-finishing thread FetcherThread30, activeThreads=31
-finishing thread FetcherThread29, activeThreads=30
-finishing thread FetcherThread48, activeThreads=29
-finishing thread FetcherThread27, activeThreads=29
-finishing thread FetcherThread26, activeThreads=28
-finishing thread FetcherThread25, activeThreads=27
-finishing thread FetcherThread24, activeThreads=26
-finishing thread FetcherThread23, activeThreads=25
-finishing thread FetcherThread22, activeThreads=24
-finishing thread FetcherThread21, activeThreads=23
-finishing thread FetcherThread20, activeThreads=22
-finishing thread FetcherThread19, activeThreads=21
-finishing thread FetcherThread18, activeThreads=20
-finishing thread FetcherThread17, activeThreads=19
-finishing thread FetcherThread16, activeThreads=18
-finishing thread FetcherThread15, activeThreads=17
-finishing thread FetcherThread14, activeThreads=16
-finishing thread FetcherThread13, activeThreads=15
-finishing thread FetcherThread12, activeThreads=14
-finishing thread FetcherThread11, activeThreads=13
-finishing thread FetcherThread10, activeThreads=12
-finishing thread FetcherThread9, activeThreads=11
-finishing thread FetcherThread8, activeThreads=10
-finishing thread FetcherThread7, activeThreads=9
-finishing thread FetcherThread5, activeThreads=8
-finishing thread FetcherThread4, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread2, activeThreads=5
-finishing thread FetcherThread49, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread28, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null
-finishing thread FetcherThread1, activeThreads=0
0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
Parsing : 
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1404787293-26339
Parsing http://www.csdn.net/
http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
Parsing http://www.itpub.net/
ParserJob: success
CrawlDB update for csdnitpub
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing csdnitpub on SOLR index -> http://ip:8983/solr/
SolrIndexerJob: starting
SolrIndexerJob: done.
SOLR dedup -> http://ip:8983/solr/
Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:42:19
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787338-30453
Fetching : 
FetcherJob: starting
FetcherJob: batchId: 1404787338-30453
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798146676
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0


二、使用命令进行逐步抓取

crawlDb, linkDb, a set of segments.
1、InjectorJob
此步骤将seed.txt中的url注入抓取队列中进行初始化。
(1)基本命令
[root@jediael local]# bin/nutch inject urls/
InjectorJob: starting at 2014-08-15 21:17:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 2
InjectorJob: total number of urls injected after normalization and filtering: 3
Injector: finished at 2014-08-15 21:17:06, elapsed: 00:00:05
其中urls/seed.txt的内容如下:
http://money.163.com/ 
http://www.hexun.com/
http://www.gw.com.cn/
(2)查看注入的url
上述步骤会在hbase中新建一个表,表名为test_1_webpage,url的相应内容会写入这张表

hbase(main):007:0> scan ‘test_1_webpage‘
ROW                              COLUMN+CELL                                                                       cn.com.gw.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          cn.com.gw.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             cn.com.gw.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               cn.com.gw.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   cn.com.gw.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     cn.com.gw.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           com.163.money:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                         com.163.money:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"               com.163.money:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                              com.163.money:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.163.money:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     com.163.money:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                          com.hexun.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          com.hexun.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             com.hexun.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               com.hexun.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.hexun.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                    com.hexun.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           3 row(s) in 0.1100 seconds

(3)关于**_webpage表
对于每一个任务,均会生成一个crawlId_webpage的表,所有已抓取及未抓取的url相关信息均会存入此表。
若url未抓取,则该url相应的行信息较少。若url已经抓取,则抓取到的内容也会放入该行,如网页内容等。

2、GeneratorJob
(1)基本命令
[root@jediael local]# bin/nutch generate -crawlId test_2

GeneratorJob: starting at 2014-08-15 21:24:49
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-15 21:24:55, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1408109089-403376773
(2)命令选项
[root@jediael local]# bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)"); 
   -noFilter      - do not activate the filter plugin to filter the url, default is true 
    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 

-adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.
-batchId - the batch id
----------------------
Please set the params.
3、FetcherJob
(1)基本命令
[root@jediael local]# bin/nutch fetch -all -crawlId test_2

FetcherJob: starting
FetcherJob: fetching all
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetching http://www.gw.com.cn/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.hexun.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread7, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread5, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=3
fetching http://money.163.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread9, activeThreads=3
-finishing thread FetcherThread1, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.6 1 pages/s, 307 307 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
4、ParserJob
(1)基本命令
[root@jediael local]# bin/nutch parse  -all -crawlId test_2

ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://www.gw.com.cn/
Parsing http://money.163.com/
Parsing http://www.hexun.com/
ParserJob: success
(2)命令参数
[root@jediael local]# bin/nutch parse 

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
<batchId> - symbolic batch ID created by Generator
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)
-all - consider pages from all crawl jobs
-resume - resume a previous incomplete job
-force - force re-parsing even if a page is already parsed

5、DbUpdaterJob
(1)基本命令
[root@jediael local]# bin/nutch updatedb

DbUpdaterJob: starting
DbUpdaterJob: done
6、SolrIndexerJob
(1)基本命令
[root@jediael local]# bin/nutch solrindex http://182.92.160.44:8583/solr/ -crawlId test_2

SolrIndexerJob: starting
SolrIndexerJob: done.
(2)命令参数
[root@jediael local]# bin/nutch solrindex 
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]


Nutch2.2.1抓取流程,布布扣,bubuko.com

Nutch2.2.1抓取流程

原文:http://blog.csdn.net/jediael_lu/article/details/38591067

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!