在 aws emr 上，将 hbase table A 的数据，对 key 做 hash，写到另外一张 table B

时间：2019-02-16 10:17:40 阅读：359 评论：0 收藏：0 [点我收藏+]

先 scan 原表，然后 bulkload 到新表。

采坑纪录
1. bulkload 产生 hfile 前，需要先对 hash(key) 做 repartition，在 shuffle 的 read 阶段，产生了以下错误

org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 3623878656, max: 3635150848)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
...
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 3623878656, max: 3635150848)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594)
...

原因：在 shuffle 的 read 阶段，会申请一个跟 block（或partition）一样大小的内存，因为每个分区过大，内存不够了
相关说明：https://issues.apache.org/jira/browse/SPARK-13510

因为netty默认使用了offheap memory，所以报了这个错误。

2. bulkload 产生 hfile 的时候，多次发生因 executor 被 killed，导致 application 失败。通过观察，发现是 executor 往本地写文件的时候，本地空间不够了。
相关问题：https://stackoverflow.com/questions/29131449/why-does-hadoop-report-unhealthy-node-local-dirs-and-log-dirs-are-bad

于是增加 yarn 集群机器，使用 hdfs balancer 均衡数据分布。

============= yarn-nodemanager =============
2019-02-15 10:18:45,562 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection (DiskHealthMonitor-Timer): Directory /mnt/yarn error, used space above threshold of 90.0%, removing from list of valid directories
2019-02-15 10:18:45,563 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection (DiskHealthMonitor-Timer): Directory /var/log/hadoop-yarn/containers error, used space above threshold of 90.0%, removing from list of valid directories
2019-02-15 10:18:45,563 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService (DiskHealthMonitor-Timer): Disk(s) failed: 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2019-02-15 10:18:45,563 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService (DiskHealthMonitor-Timer): Most of the disks failed. 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2019-02-15 10:18:45,789 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService (AsyncDispatcher event handler): Cache Size Before Clean: 589300919, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2019-02-15 10:18:46,668 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (AsyncDispatcher event handler): Container container_1549968021090_0114_01_000006 transitioned from RUNNING to KILLING
2019-02-15 10:18:46,668 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (AsyncDispatcher event handler): Container container_1549968021090_0114_01_000016 transitioned from RUNNING to KILLING


============= yarn-resourcemanager.log =============
019-02-15 10:18:45,664 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl (AsyncDispatcher event handler): Node ip-10-6-43-89.ap-south-1.compute.internal:8041 reported UNHEALTHY with details: 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2019-02-15 10:18:45,664 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl (AsyncDispatcher event handler): ip-10-6-43-89.ap-south-1.compute.internal:8041 Node Transitioned from RUNNING to UNHEALTHY
2019-02-15 10:18:45,664 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl (ResourceManager Event Processor): container_1549968021090_0114_01_000006 Container Transitioned from RUNNING to KILLED

3. emr的配置中，spark 的本地文件缓存路径为 /mnt/yarn/usercache/hadoop/appcache/application_1549968021090_0135/blockmgr-535fd27a-4b80-4116-b855-17ab7be68f1c。与 hdfs 在一个硬盘上。

在 aws emr 上，将 hbase table A 的数据，对 key 做 hash，写到另外一张 table B

原文：https://www.cnblogs.com/keepthinking/p/10386811.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)