windows环境运行MapReduce程序,需要下载Hadoop和winutils
Hadoop下载地址是:https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Hadoop在Windows环境运行需要的动态库winutils下载地址: https://github.com/steveloughran/winutils/archive/master.zip
注意版本匹配性
解压hadoop-3.3.0.tar.gz压缩包,这里我将压缩包解压到D:\Apache\hadoop(这个目录很重要,后面的操作都是基于这个目录)目录下
解压动态库winutils-master.zip,这里我将压缩包解压到D:\Apache\hadoop目录下
将D:\Apache\hadoop\winutils-master\hadoop-3.0.0\bin\hadoop.dll文件复制到C:\Windows\system32目录下
将D:\Apache\hadoop\winutils-master\hadoop-3.0.0\bin\winutils.exe文件复制到D:\Apache\hadoop\hadoop-3.3.0\bin目录下
配置系统环境变量,HADOOP_HOME为D:\Apache\hadoop\hadoop-3.3.0
在Path环境变量中添加一项%HADOOP_HOME%\bin
创建一个Maven项目,pom.xml文件如下
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>net.sunmonkey</groupId> <artifactId>hadoop-demo</artifactId> <version>1.0</version> </parent> <artifactId>mapreduce-demo</artifactId> <packaging>jar</packaging> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.3.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>3.3.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>3.3.0</version> </dependency> </dependencies> <build> <finalName>mapreduce-demo</finalName> </build> </project>
使用MapReduce来统计文本中的每个单词出现的个数,Word Count,这里就不是hello world了。
创建文件D:\Apache\hadoop\data\test.txt,内容如下
hello world tom
hello tom world
tom hello world
how are you
编写一个WCMapper类并且继承org.apache.hadoop.mapreduce.Mapper就可以,具体实现如下
package net.sunmonkey.mapreduce.mapper; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { Text keyOut = new Text(); IntWritable valueOut = new IntWritable(); String[] arr = value.toString().split(" "); for(String str: arr){ keyOut.set(str); valueOut.set(1); context.write(keyOut, valueOut); } } }
编写一个WCReducer类,并且继承org.apache.hadoop.mapreduce.Reducer就可以,具体实现如下
package net.sunmonkey.mapreduce.reducer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; for(IntWritable intWritable: values){ count += intWritable.get(); } context.write(key, new IntWritable(count)); } }
编写一个WCApplication类,使用主方法提交一个MapReduce任务,具体实现如下。
运行该主方法的时候,需要设置两个参数,第一个参数是统计单词的文档文档的绝对路径,这里我们设置为D:\Apache\hadoop\data\test.txt;第二个参数是计算结果写入的路径,该路径必须不存在,否则会报错,这里我们设置为D:\Apache\hadoop\data\result。
package net.sunmonkey.mapreduce; import net.sunmonkey.mapreduce.mapper.WCMapper; import net.sunmonkey.mapreduce.reducer.WCReducer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WCApplication { public static void main(String[] args) throws Exception { //创建该对象的时候,该配置对象会读取hadoop包内的配置文件,默认使用本地模式 Configuration configuration = new Configuration(); if(args.length == 2){ FileSystem.get(configuration).delete(new Path(args[1]), false); } //获取一个任务实例 Job job = Job.getInstance(configuration); //设置任务名称 job.setJobName("WCApplication"); //设置主方法类 job.setJarByClass(WCApplication.class); //设置任务的输入格式 job.setInputFormatClass(TextInputFormat.class); //设置任务的输出格式 job.setOutputFormatClass(TextOutputFormat.class); //设置读取的文件输入路径 TextInputFormat.addInputPath(job, new Path(args[0])); //设置计算结果输出路径,结果将以文本的方式输出 TextOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce任务个数 job.setNumReduceTasks(1); //设置map类 job.setMapperClass(WCMapper.class); //设置reduce类 job.setReducerClass(WCReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //提交任务并且等待任务完成 job.waitForCompletion(true); } }
运行主方法,即可在D:\Apache\hadoop\data\result目录中看到MapReduce任务计算的结果。
在该目录的part-r-00000文件中即可看到计算结果,如下
are 1 hello 3 how 1 tom 3 world 3 you 1
原文:https://www.cnblogs.com/chengwenqin/p/13673530.html