scala> val rdd1 = sc.textFile("file:///Users/***/spark/test_data/word.txt")
scala> rdd1.filter(x=>x.contains("huahua")) foreach println
huahua hadoop spark
huahua hadoop
也可以预先定义func:
scala> val func:String=>Boolean = {x:String=>x.contains("huahua")}
scala> rdd1.filter(func) foreach println
huahua hadoop spark
huahua hadoop
综合练习 WordCount:
// flatMap,map,reduceByKey scala> rdd1.flatMap(_.split(" ")).map((_,1)).reduceByKey((x,y)=>x+y).foreach(println) //reduceByKey中的func仅仅作用于 PairRDD的value,reduceByKey((x,y)=>x+y)可写成reduceByKey(_+_) (mapreduce,2) (huahua,2) (spark,5) (hadoop,5) (spark2.2,1) (spark2.4,2) (kylin,1) (hbase,4) // flatMap,map,groupByKey scala> rdd1.flatMap(_.split(" ")).map((_,1)).groupByKey().map(t=>(t._1,t._2.sum)) foreach println //groupByKey之后得到的是org.apache.spark.rdd.RDD[(String, Iterable[Int])],每个元素为一个tuple,map函数作用于每个tuple (mapreduce,2) (huahua,2) (spark,5) (hadoop,5) (spark2.2,1) (spark2.4,2) (kylin,1)
原文:https://www.cnblogs.com/wooluwalker/p/12319144.html