SparkStreaming

时间：2019-06-21 00:34:48 阅读：122 评论：0 收藏：0 [点我收藏+]

1、RDD基础

　　RDD.scala源码写到RDD的5个属性。driver生成RDD 分发到个executor，RDD可理解为操作描述，除sc.parallelize()生成的RDD包含数据外，一般RDD不包含具体数据，只存储要读取的文件位置，DAG等。

KafkaUtils.createDirectStream生成KafkaRDD，分区与topics分区数对应。

基于receiver的方式生成blockRDD，默认200ms取一次数据保存在block，由blockmanager管理，分区数与block数有关，与kafka分区数无关，offset由zookeeper管理。

处理逻辑写在foreachRDD中，转变为sparkcore编程，便于发生故障时，做数据校验二次处理。

 * Internally, each RDD is characterized by five main properties:
 *  - A list of partitions
 *  - A function for computing each split
 *  - A list of dependencies on other RDDs
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

SparkStreaming

原文：https://www.cnblogs.com/csyusu/p/11062210.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)