SparkSQL内置函数 -- countDistinct

时间：2020-09-14 12:47:15 阅读：212 评论：0 收藏：0 [点我收藏+]

[root@centos00 ~]$ cd hadoop-2.6.0-cdh5.14.2/
[root@centos00 hadoop-2.6.0-cdh5.14.2]$ sbin/hadoop-daemon.sh start namenode
[root@centos00 hadoop-2.6.0-cdh5.14.2]$ sbin/hadoop-daemon.sh start datanode
[root@centos00 hadoop-2.6.0-cdh5.14.2]$ sbin/yarn-daemon.sh start resourcemanager
  
[root@centos00 ~]$ cd /opt/cdh5.14.2/hive-1.1.0-cdh5.14.2/
[root@centos00 hive-1.1.0-cdh5.14.2]$ bin/hive --service metastore &
  
[root@centos00 ~]$ cd /opt/cdh5.14.2/spark-2.2.1-cdh5.14.2/
[root@centos00 spark-2.2.1-cdh5.14.2]$ sbin/start-master.sh
[root@centos00 spark-2.2.1-cdh5.14.2]$ sbin/start-slaves.sh

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val arr = Array(("a", "20"), ("a", "30"), ("b", "20"), ("a", "20"))
arr: Array[(String, String)] = Array((a,20), (a,30), (b,20), (a,20))

scala> val df = sc.parallelize(arr).toDF("id", "age")
df: org.apache.spark.sql.DataFrame = [id: string, age: string]

scala> df.show(false)
+---+---+
|id |age|
+---+---+
|a  |20 |
|a  |30 |
|b  |20 |
|a  |20 |
+---+---+


scala> df.groupBy(‘id).agg(countDistinct(‘age) as ‘distinctAge).show(false)
+---+-----------+
|id |distinctAge|
+---+-----------+
|b  |1          |
|a  |2          |
+---+-----------+


scala> df.groupBy("id").agg(countDistinct("age") as "distinctAge").show(false)
+---+-----------+                                                               
|id |distinctAge|
+---+-----------+
|b  |1          |
|a  |2          |
+---+-----------+

原文：https://www.cnblogs.com/ji-hf/p/13665911.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)