MapReduce的InputFormat过程的学习

时间：2015-12-14 01:28:09 阅读：281 评论：0 收藏：0 [点我收藏+]

转自：http://blog.csdn.net/androidlushangderen/article/details/41114259

昨天经过几个小时的学习，把MapReduce的第一个阶段的过程学习了一下，也就是最最开始的时候从文件中的Data到key-value的映射，也就是InputFormat的过程。虽说过程不是很难，但是也存在很多细节的。也很少会有人对此做比较细腻的研究，学习。今天，就让我来为大家剖析一下这段代码的原理。我还为此花了一点时间做了几张结构图，便于大家理解。在这里先声明一下，我研究的MapReduce主要研究的是旧版的API，也就是mapred包下的。

InputFormat最最原始的形式就是一个接口。后面出现的各种Format都是他的衍生类。结构如下，只包含最重要的2个方法:

[java] view plain copy print ?

public interface InputFormat<K, V> {
/**
* Logically split the set of input files for the job.
*
* Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.
*
* Note: The split is a logical split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <input-file-path, start, offset> tuple.
*
* @param job job configuration.
* @param numSplits the desired number of splits, a hint.
* @return an array of {@link InputSplit}s for the job.
*/
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
/**
* Get the {@link RecordReader} for the given {@link InputSplit}.
*
* It is the responsibility of the <code>RecordReader</code> to respect
* record boundaries while processing the logical split to present a
* record-oriented view to the individual task.
*
* @param split the {@link InputSplit}
* @param job the job that this split belongs to
* @return a {@link RecordReader}
*/
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}

所以后面讲解，我也只是会围绕这2个方法进行分析。当然我们用的最多的是从文件中获得输入数据，也就是FileInputFormat这个类。继承关系如下:

[java] view plain copy print ?

public abstract class FileInputFormat<K, V> implements InputFormat<K, V>

我们看里面的1个主要方法:

[java] view plain copy print ?

public InputSplit[] getSplits(JobConf job, int numSplits)

返回的类型是一个InputSpilt对象，这是一个抽象的输入Spilt分片概念。结构如下:

[java] view plain copy print ?

public interface InputSplit extends Writable {
/**
* Get the total number of bytes in the data of the <code>InputSplit</code>.
*
* @return the number of bytes in the input split.
* @throws IOException
*/
long getLength() throws IOException;
/**
* Get the list of hostnames where the input split is located.
*
* @return list of hostnames where data of the <code>InputSplit</code> is
* located as an array of <code>String</code>s.
* @throws IOException
*/
String[] getLocations() throws IOException;
}

提供了与数据相关的2个方法。后面这个返回的值会被用来传递给RecordReader里面去的。在想理解getSplits方法之前还有一个类需要理解，FileStatus，里面包装了一系列的文件基本信息方法:

[java] view plain copy print ?

public class FileStatus implements Writable, Comparable {
private Path path;
private long length;
private boolean isdir;
private short block_replication;
private long blocksize;
private long modification_time;
private long access_time;
private FsPermission permission;
private String owner;
private String group;

.....

看到这里你估计会有点晕了，下面是我做的一张小小类图关系:

技术分享

可以看到，FileSpilt为了兼容新老版本，继承了新的抽象类InputSpilt，同时附上旧的接口形式的InputSpilt。下面我们看看里面的getspilt核心过程:

[java] view plain copy print ?

/** Splits files returned by {@link #listStatus(JobConf)} when
* they‘re too big.*/
@SuppressWarnings("deprecation")
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
//获取所有的状态文件
FileStatus[] files = listStatus(job);
// Save the number of input files in the job-conf
//在job-cof中保存文件的数量
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0;
// compute total size,计算文件总的大小
for (FileStatus file: files) { // check we have valid files
if (file.isDir()) {
//如果是目录不是纯文件的直接抛异常
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
//用户期待的划分大小，总大小除以spilt划分数目
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
//获取系统的划分最小值
long minSize = Math.max(job.getLong("mapred.min.split.size", 1),
minSplitSize);
// generate splits
//创建numSplits个FileSpilt文件划分量
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job);
long length = file.getLen();
//获取此文件的block的位置列表
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
//如果文件系统可划分
if ((length != 0) && isSplitable(fs, path)) {
//计算此文件的总的block块的大小
long blockSize = file.getBlockSize();
//根据期待大小，最小大小，得出最终的split分片大小
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
//如果剩余待划分字节倍数为划分大小超过1.1的划分比例，则进行拆分
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
//获取提供数据的splitHost位置
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
//添加FileSplit
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
splitHosts));
//数量减少splitSize大小
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
//添加刚刚剩下的没划分完的部分，此时bytesRemaining已经小于splitSize的1.1倍了
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
//不划分，直接添加Spilt
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(new FileSplit(path, 0, length, splitHosts));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
//最后返回FileSplit数组
LOG.debug("Total # of splits: " + splits.size());
return splits.toArray(new FileSplit[splits.size()]);
}

里面有个computerSpiltSize方法很特殊，考虑了很多情况，总之最小值不能小于系统设定的最小值。要与期待值，块大小，系统允许最小值:

[java] view plain copy print ?

protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}

上述过程的相应流程图如下:

技术分享

3种情况3中年执行流程。

处理完getSpilt方法然后，也就是说已经把数据从文件中转划到InputSpilt中了，接下来就是给RecordRead去取出里面的一条条的记录了。当然这在FileInputFormat是抽象方法，必须由子类实现的，我在这里挑出了2个典型的子类SequenceFileInputFormat，和TextInputFormat。他们的实现RecordRead方法如下：

[java] view plain copy print ?