一、被索引的域 Schema.xml
<?xml version="1.0" encoding="UTF-8" ?> <schema name="nutch" version="1.5"> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="date" class="solr.TrieDateField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <fieldType name="url" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/> </analyzer> </fieldType> </types> <fields> <field name="id" type="string" stored="true" indexed="true"/> <field name="_version_" type="long" indexed="true" stored="true"/> <!-- core fields --> <field name="batchId" type="string" stored="true" indexed="false"/> <field name="digest" type="string" stored="true" indexed="false"/> <field name="boost" type="float" stored="true" indexed="false"/> <!-- fields for index-basic plugin --> <field name="host" type="url" stored="false" indexed="true"/> <field name="url" type="url" stored="true" indexed="true" required="true"/> <field name="content" type="text" stored="false" indexed="true"/> <field name="title" type="text" stored="true" indexed="true"/> <field name="cache" type="string" stored="true" indexed="false"/> <field name="tstamp" type="date" stored="true" indexed="false"/> <!-- fields for index-anchor plugin --> <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/> <!-- fields for index-more plugin --> <field name="type" type="string" stored="true" indexed="true" multiValued="true"/> <field name="contentLength" type="long" stored="true" indexed="false"/> <field name="lastModified" type="date" stored="true" indexed="false"/> <field name="date" type="date" stored="true" indexed="true"/> <!-- fields for languageidentifier plugin --> <field name="lang" type="string" stored="true" indexed="true"/> <!-- fields for subcollection plugin --> <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/> <!-- fields for feed plugin (tag is also used by microformats-reltag)--> <field name="author" type="string" stored="true" indexed="true"/> <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/> <field name="feed" type="string" stored="true" indexed="true"/> <field name="publishedDate" type="date" stored="true" indexed="true"/> <field name="updatedDate" type="date" stored="true" indexed="true"/> <!-- fields for creativecommons plugin --> <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/> <!-- fields for tld plugin --> <field name="tld" type="string" stored="false" indexed="false"/> </fields> <uniqueKey>id</uniqueKey> <defaultSearchField>content</defaultSearchField> <solrQueryParser defaultOperator="OR"/> </schema>分析上述文件,主要指定了以下内容:
./bin/crawl urls csdnitpub 5
crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
<crawlID> :这个抓取任务的ID
InjectorJob: starting at 2014-07-08 10:41:27
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:41:34
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787293-26339
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1404787293-26339
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798101129
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.csdn.net/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.itpub.net/ (queue crawl delay=5000ms)
0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
FetcherJob: done
Parsing :
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1404787293-26339
Parsing http://www.csdn.net/
http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
Parsing http://www.itpub.net/
ParserJob: success
CrawlDB update for csdnitpub
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing csdnitpub on SOLR index ->
SolrIndexerJob: starting
SolrIndexerJob: done.
SOLR dedup ->
Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:42:19
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787338-30453
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1404787338-30453
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798146676
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.
When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling
data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.
A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:
? GeneratorJob
? FetcherJob
? ParserJob (optionally done while fetching using ‘fetch.parse‘)
? DbUpdaterJob
Additionally, the following processes need to be understood:
? InjectorJob
? Invertlinks
? Indexing with Apache Solr
First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.
Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which
every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and
form a group. In most NoSQL stores, row keys are sorted and give an advantage.
Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:
? org.apache..www:http/
? org.apache.gora:http/
Let‘s define each step in depth so that we can understand crawling step-by-step.
Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The
linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:
? A crawl_generate job will be used for a set of URLs to be fetched
? A crawl_fetch job will contain the status of fetching each URL
? A content will contain the content of rows retrieved from every URL
Now let‘s understand each job of crawling in detail.
(1)进入bhase shell
[root@jediael44 hbase-0.90.4]# ./bin/hbase shell
HBase Shell; enter ‘help<RETURN>‘ for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011
hbase(main):001:0> list
4 row(s) in 0.7620 seconds
hbase(main):002:0> describe ‘20140710_webpage‘
其中行的名称可以通过scan ‘20140710_webpage‘得到。