<Spark><Programming><Key/Value Pairs><RDD>

时间：2017-05-08 21:34:18 阅读：347 评论：0 收藏：0 [点我收藏+]

Working with key/value Pairs

Pair RDDs are a useful building block in many programs, as they expose operations that allow u to act on each key in parallel or regroup data across network.
Eg: pair RDDs have a reduceByKey() method that can aggeragate data separately for each key; join() method that can merge two RDDs together by grouping elements with the same key.

Many formats we loading from will directly return pair RDDs for their k/v values.
By turning a regular RDD into a pair RDD --> Using map() function

val pairs = lines.map(x => (x.split(" ")(0), x))

我们同样可以给Spark传送函数，不过由于pair RDDs包含的是元组tuple，所以我们要传送的函数式操作在tuples之上的。实际上Pair RDDs就是RDDs of Tuple2 object。

1. reduceByKey()并行地为数据集中每个key运行reduce操作。
2. reduceByKey()属于transformation，它返回一个新的RDD。这样做是考虑到数据集有大量的keys。

原文：http://www.cnblogs.com/wttttt/p/6827870.html

踩

(0)

评论一句话评论（0）

分享档案

更多>