现象:
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
|totalCount|January|February|March|April| May|June|July|August|September|October|November|December|totalMileage|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
| 33808| 0| 0| 0| 0|33798| 0| 0| 0| 0| 0| 0| 0| 79995.0|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
当前表预分区10个
按照当月数据看,当前测试表中总数量是:33798
hbase的总数量也是:33798
神奇的地方:使用sparkSQL对接hbase查询的数量是:33808
当时的sql语句是:select count(1) from orderData
很神奇,因为通过sql查询后,总数据多了10条
============================================================
原因:
sparkSQL查询hbase,使用api:newApiHadoopRdd
默认情况下:newApiHadoopRdd一个region启动一个task
巧合的是,我有10个region,并且最关键的是画蛇添足,添加了一个
TableInputFormat.SCAN_BATCHSIZE
lazy val buildScan = { val hbaseConf = HBaseConfiguration.create() hbaseConf.set("hbase.zookeeper.quorum", GlobalConfigUtils.hbaseQuorem) hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTableName) hbaseConf.set(TableInputFormat.SCAN_COLUMNS, queryColumns) hbaseConf.set(TableInputFormat.SCAN_ROW_START, startRowKey) hbaseConf.set(TableInputFormat.SCAN_ROW_STOP, endRowKey) hbaseConf.set(TableInputFormat.SCAN_BATCHSIZE , "10000")//TODO 此处导致查询数据不一致 hbaseConf.set(TableInputFormat.SCAN_CACHEDROWS , "10000") hbaseConf.set(TableInputFormat.SHUFFLE_MAPS , "1000") val hbaseRdd = sqlContext.sparkContext.newAPIHadoopRDD( hbaseConf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result] ) val rs: RDD[Row] = hbaseRdd.map(tuple => tuple._2).map(result => { var values = new ArrayBuffer[Any]() hbaseTableFields.foreach { field => values += Resolver.resolve(field, result) } Row.fromSeq(values.toSeq) }) rs }
加上这个后,代表每次scan的时候,从Hbase中一次性检索多少条数据;
但是正因为加了这个东西,就导致,每次查询,某个rowkey会跟着重复检索出来,10个分区,恰好重复10次,导致数据不一致
解决:
去掉TableInputFormat.SCAN_BATCHSIZE的设置即可
去掉后的查询结果:
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
|totalCount|January|February|March|April| May|June|July|August|September|October|November|December|totalMileage|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
| 33798| 0| 0| 0| 0|33798| 0| 0| 0| 0| 0| 0| 0| 79995.0|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
问题解决~
原文:https://www.cnblogs.com/niutao/p/10824749.html