首页 > 其他 > 详细

hadoop输出lzo文件并添加索引

时间:2016-08-15 17:09:49      阅读:172      评论:0      收藏:0      [点我收藏+]
public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        conf.set("mapred.job.tracker", Constants.HADOOP_MAIN_IP + Constants.MAO_HAO + Constants.HADOOP_MAIN_PORT);

        if (args.length != 3) {
            System.err.println("Usage: Data Deduplication <in> <out> <reduceNum>");
            System.exit(2);
        }
        Job job = new Job(conf, "ETLTld Job");
        job.setJarByClass(ETLTldMain.class);

        job.setMapperClass(ETLTldMapper.class);
        job.setReducerClass(ETLTldReducer.class);

        job.setInputFormatClass(LzoTextInputFormat.class);

        job.setNumReduceTasks(Integer.parseInt(args[2]));

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        FileOutputFormat.setCompressOutput(job, true);
        FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class);
        int result = job.waitForCompletion(true) ? 0 : 1;

        if (result == 0) {
            LzoIndexer lzoIndexer = new LzoIndexer(conf);
            lzoIndexer.index(new Path(args[1]));
            System.exit(result);
        } else if(result == 1){
            System.exit(result);
        }

    }

 

如果已经有了lzo文件,可以采用如下方法添加索引:

bin/yarn jar /module/cloudera/parcels/GPLEXTRAS-5.4.0-1.cdh5.4.0.p0.27/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.4.0.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/cndns.db/ods_cndns_log/dt=20160803/node=alicn/part-r-00000.lzo

 

lzo格式默认是不支持splitable的,需要为其添加索引文件,才能支持多个map并行对lzo文件进行处理。

 

【参考】http://blog.csdn.net/wisgood/article/details/17080361

hadoop输出lzo文件并添加索引

原文:http://www.cnblogs.com/zhzhang/p/5773473.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!