首页 > 其他 > 详细

Spark:reduceByKey函数的用法

时间:2017-10-28 21:33:06      阅读:942      评论:0      收藏:0      [点我收藏+]

reduceByKey函数API:

def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]

 该函数利用映射函数将每个K对应的V进行运算。 

其中参数说明如下:
- func:映射函数,根据需求自定义;
- partitioner:分区函数;
- numPartitions:分区数,默认的分区函数是HashPartitioner。

返回值:可以看出最终是返回了一个KV键值对。

使用示例:

linux:/$ spark-shell
。。。
17/10/28 20:33:54 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
17/10/28 20:33:55 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
17/10/28 20:33:57 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR env not found!
17/10/28 20:33:58 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR env not found!
SQL context available as sqlContext.

scala> val x = sc.parallelize(List(
     |     ("a", "b", 1),
     |     ("a", "b", 1),
     |     ("c", "b", 1),
     |     ("a", "d", 1))
     | )
x: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:21

scala> val byKey = x.map({case (id,uri,count) => (id,uri)->count})
byKey: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[1] at map at <console>:23

scala> val reducedByKey = byKey.reduceByKey(_ + _)
reducedByKey: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[2] at reduceByKey at <console>:25


scala> reducedByKey.collect.foreach(println)
((c,b),1)
((a,d),1)
((a,b),2)

 

Spark:reduceByKey函数的用法

原文:http://www.cnblogs.com/yy3b2007com/p/7748074.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!