【转载】Hadoop自定义RecordReader

时间：2017-01-10 23:40:50 阅读：393 评论：0 收藏：0 [点我收藏+]

转自：http://www.linuxidc.com/Linux/2012-04/57831.htm

系统默认的LineRecordReader是按照每行的偏移量做为map输出时的key值，每行的内容作为map的value值，默认的分隔符是回车和换行。

现在要更改map对应的输入的<key,value>值，key对应的文件的路径（或者是文件名），value对应的是文件的内容（content）。

那么我们需要重写InputFormat和RecordReader，因为RecordReader是在InputFormat中调用的，当然重写RecordReader才是重点！

下面看代码InputFormat的重写：

public class chDicInputFormat extends FileInputFormat<Text,Text>
implements JobConfigurable{
private CompressionCodecFactory compressionCodecs = null;
public void configure(JobConf conf) {
compressionCodecs = new CompressionCodecFactory(conf);
}
/**
* @brief isSplitable 不对文件进行切分，必须对文件整体进行处理
*
* @param fs
* @param file
*
* @return false
*/
protected boolean isSplitable(FileSystem fs, Path file) {
// CompressionCodec codec = compressionCodecs.getCode(file);
return false;//以文件为单位，每个单位作为一个split，即使单个文件的大小超过了64M，也就是Hadoop一个块得大小，也不进行分片
}
public RecordReader<Text,Text> getRecordReader(InputSplit genericSplit,
JobConf job, Reporter reporter) throws IOException{
reporter.setStatus(genericSplit.toString());
return new chDicRecordReader(job,(FileSplit)genericSplit);
}
}

下面来看RecordReader的重写：

public class chDicRecordReader implements RecordReader<Text,Text> {
private static final Log LOG = LogFactory.getLog(chDicRecordReader.class.getName());
private CompressionCodecFactory compressionCodecs = null;
private long start;
private long pos;
private long end;
private byte[] buffer;
private String keyName;
private FSDataInputStream fileIn;
public chDicRecordReader(Configuration job,FileSplit split) throws IOException{
start = split.getStart(); //从中可以看出每个文件是作为一个split的
end = split.getLength() + start;
final Path path = split.getPath();
keyName = path.toString();
LOG.info("filename in hdfs is : " + keyName);
final FileSystem fs = path.getFileSystem(job);
fileIn = fs.open(path);
fileIn.seek(start);
buffer = new byte[(int)(end - start)];
this.pos = start;
}
public Text createKey() {
return new Text();
}
public Text createValue() {
return new Text();
}
public long getPos() throws IOException{
return pos;
}
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
public boolean next(Text key, Text value) throws IOException{
while(pos < end) {
key.set(keyName);
value.clear();
fileIn.readFully(pos,buffer);
value.set(buffer);
// LOG.info("---内容: " + value.toString());
pos += buffer.length;
LOG.info("end is : " + end + " pos is : " + pos);
return true;
}
return false;
}
public void close() throws IOException{
if(fileIn != null) {
fileIn.close();
}
}
}

通过上面的代码，然后再在main函数中设置InputFormat对应的类，就可以使用这种新的读入格式了。

【转载】Hadoop自定义RecordReader

原文：http://www.cnblogs.com/YangtzeYu/p/6271211.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)