(因为是RDD和dataFrame数据是只读的,所以不能做修改,删除操作。)
有两种将RDD转换为Dataframe的形式
这里直接读取json文件并转换为dataFrame结构
from pyspark.sql import SparkSession
spark=SparkSession.builder.getOrCreate()
df = spark.read.json("/user/hadoop/data.json")
df.createOrReplaceTempView("data")
dataDF = spark.sql("select title from data where title like '%中国%'").show()
select * from data
select title from data where title like '%中国%'
SELECT DISTINCT country FROM data
spark.sql("select AVG(id) from data").show()
spark.sql("select COUNT(id) from data").show()
spark.sql("select COUNT(*) AS nums from data").show()
spark.sql("select FIRST(name) AS name from data where id=1").show()
类似使用的函数:LAST MAX MIN SUM
原文:https://www.cnblogs.com/panfengde/p/11434538.html