问题:
nutch默认在fetch的时候batchid是-all。这样导致只要有batchid的记录都会被从新抓取,从新parse。记录比较多的时候会非常慢,而且没有必要。
代码:
fetchJob
| 
       1 
      2 
      3 
      4 
      5 
      6 
      7 
      8 
      9 
      10 
      11 
      12 
      13  | 
    
      public Map<String,Object> run(Map<String,Object> args) throws 
Exception {   checkConfiguration();   String batchId = (String)args.get(Nutch.ARG_BATCH);   Integer threads = (Integer)args.get(Nutch.ARG_THREADS);   Boolean shouldResume = (Boolean)args.get(Nutch.ARG_RESUME);   Integer numTasks = (Integer)args.get(Nutch.ARG_NUMTASKS);   if 
(threads != null 
&& threads > 0) {     getConf().setInt(THREADS_KEY, threads);   }   if 
(batchId == null) {       batchId = Nutch.ALL_BATCH_ID_STR;   } | 
这样batchid就是-all。我改成了batchId = getConf().get(GeneratorJob.BATCH_ID);只抓GeneratorJob标记的url
原文:http://www.cnblogs.com/fengjiaoan/p/3567509.html