MapReduce的Combiner组件

时间：2019-09-14 21:02:13 阅读：86 评论：0 收藏：0 [点我收藏+]

每一个map都可能会产生大量的本地输出，Combiner的作用就是对map端的输出先做一次合并，以减少在map和reduce节点之间的数据传输量，以提高网络IO性能，是MapReduce的一种优化手段之一。

combiner是MR程序中Mapper和Reducer之外的一种组件
combiner组件的父类就是Reducer
combiner和reducer的区别在于运行的位置：
Combiner是在每一个maptask所在的节点运行
Reducer是接收全局所有Mapper的输出结果；
combiner的意义就是对每一个maptask的输出进行局部汇总，以减小网络传输量
具体实现步骤：

自定义一个combiner继承Reducer，重写reduce方法
在job中设置： job.setCombinerClass(CustomCombiner.class)

combiner能够应用的前提是不能影响最终的业务逻辑，而且，combiner的输出kv应该跟reducer的输入kv类型要对应起来

总结：

combiner：

局部聚合组件，针对每个maptask进行局部聚合运算从而进行优化
局部聚合之后，减少了map与reduce之间跨网络传递的数据量，减少网络IO
如果涉及业务，不可以使用combiner,因为局部合并会对最终结果发送改变
combiner本身就是reduce只不过应用的范围只限于maptask输出，不是全局
开启combiner需要在job中进行设置

job.setCombinerClass(WordCountReducer.class)

每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。
combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。
如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。
注意：Combiner的输出是Reducer的输入，如果Combiner是可插拔的，添加Combiner绝不能改变最终的计算结果。所以Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致，且不影响最终结果的场景。比如累加，最大值等。

代码举例

准备（数据）：

  hello tom
  hello kittty
  hello jerry
  hello cat
  hello tom

 
  hello tom
  hello kittty
  hello jerry
  hello cat
  hello tom

WordCountMapper类

 import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Mapper;

  public class WordCountMapper extends Mapper<LongWritable, Text, Text,LongWritable >{
    protected void map(LongWritable key, Text value, 
    Mapper<LongWritable, Text, Text,LongWritable >.Context context)
        throws IOException, InterruptedException {
      // TODO Auto-generated method stub
      String hang = value.toString();
      String[] strings = hang.split(" ");
      for(String string : strings) {
        context.write(new Text(string),new LongWritable(1));
      }		
    }
  }

 
 
 
 
 
 import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Mapper;
  public class WordCountMapper extends Mapper<LongWritable, Text, Text,LongWritable >{
    protected void map(LongWritable key, Text value, 
    Mapper<LongWritable, Text, Text,LongWritable >.Context context)
        throws IOException, InterruptedException {
      // TODO Auto-generated method stub
      String hang = value.toString();
      String[] strings = hang.split(" ");
      for(String string : strings) {
        context.write(new Text(string),new LongWritable(1));
      }
    }
  }
 
 

WordCountReducer类

import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Reducer;

  public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{

    @Override
    protected void reduce(Text key2, Iterable<LongWritable> value2,
        Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
      long sum=0;

      for(LongWritable i :value2){
        sum += i.get();
      }

      context.write(key2,new LongWritable(sum));
    }
  }

 
 
 
 
 
import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Reducer;
  public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
    @Override
    protected void reduce(Text key2, Iterable<LongWritable> value2,
        Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
      long sum=0;
      for(LongWritable i :value2){
        sum += i.get();
      }
      context.write(key2,new LongWritable(sum));
    }
  }
 
 

MRClient类

import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

  public class MRClient {

    public static void main(String[] args) throws Exception {
      // TODO Auto-generated method stub
      Configuration configuration=new Configuration();
      Job job=Job.getInstance(configuration);
      //设置当前作业主函数所在类
      job.setJarByClass(MRClient.class);

      job.setMapperClass(WordCountMapper.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(LongWritable.class);
      FileInputFormat.setInputPaths(job, "c:/data.txt");

      job.setReducerClass(WordCountReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(LongWritable.class);
      FileOutputFormat.setOutputPath(job,new Path("c:/out"));

      job.setCombinerClass(WordCountReducer.class);

      //提交作业，参数：true为显示计算过程，false不显示计算过程
      job.waitForCompletion(true);

    }
  }

 
 
 
 
 
import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  public class MRClient {
    public static void main(String[] args) throws Exception {
      // TODO Auto-generated method stub
      Configuration configuration=new Configuration();
      Job job=Job.getInstance(configuration);
      //设置当前作业主函数所在类
      job.setJarByClass(MRClient.class);
      job.setMapperClass(WordCountMapper.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(LongWritable.class);
      FileInputFormat.setInputPaths(job, "c:/data.txt");
      job.setReducerClass(WordCountReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(LongWritable.class);
      FileOutputFormat.setOutputPath(job,new Path("c:/out"));
      job.setCombinerClass(WordCountReducer.class);
      //提交作业，参数：true为显示计算过程，false不显示计算过程
      job.waitForCompletion(true);
    }
  }
 
 

想要使用 `combiner`功能只要在组装作业时，添加下面一行代码即可：

// 设置 Combiner
job.setCombinerClass(WordCountReducer.class);

 
// 设置 Combiner
job.setCombinerClass(WordCountReducer.class);

加入 `combiner`后统计结果是不会有变化的，但是可以从打印的日志看出 `combiner`的效果：

没有加入 `combiner`的打印日志：

技术分享图片

加入 combiner后的打印日志如下：

技术分享图片

这里我们只有一个输入文件并且小于 128M，所以只有一个 Map 进行处理。可以看到经过 combiner 后，records 由 3519 降低为 6(样本中单词种类就只有 6 种)，在这个用例中 combiner 就能极大地降低需要传输的数据量。

MapReduce的Combiner组件

原文：https://www.cnblogs.com/TiePiHeTao/p/a709b1bf188ace34ae512fd3f01031c6.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)

MapReduce的Combiner组件

总结：

combiner：

代码举例

准备（数据）：

WordCountReducer类

MRClient类

想要使用 combiner功能只要在组装作业时，添加下面一行代码即可：

加入 combiner后统计结果是不会有变化的，但是可以从打印的日志看出 combiner的效果：

没有加入 combiner的打印日志：

想要使用 `combiner`功能只要在组装作业时，添加下面一行代码即可：

加入 `combiner`后统计结果是不会有变化的，但是可以从打印的日志看出 `combiner`的效果：

没有加入 `combiner`的打印日志：