My cluster: 1 master, 11 slaves, each node has 6 GB memory.
My settings:
spark.executor.memory=4g, Dspark.akka.frameSize=512
Here is the problem:
First, I read some data (2.19 GB) from HDFS to RDD:
val imageBundleRDD = sc.newAPIHadoopFile(...)
Second, do something on this RDD:
val res = imageBundleRDD.map(data => {
val desPoints = threeDReconstruction(data._2, bg)
(data._1, desPoints)
})
Last, output to HDFS:
res.saveAsNewAPIHadoopFile(...)
When I run my program it shows:
.....
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:24 as TID 33 on executor 9: Salve7.Hadoop (NODE_LOCAL)
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:24 as 30618515 bytes in 210 ms
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:36 as TID 34 on executor 2: Salve11.Hadoop (NODE_LOCAL)
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:36 as 30618515 bytes in 449 ms
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL)
Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since ‘akka.jvm-exit-on-fatal-error‘ is enabled for ActorSystem[spark]
spark.executor.memory=6g
. Make sure you‘re using as much memory as possible by checking the UI (it will say how much mem you‘re using)spark.storage.memoryFraction
. If you don‘t use cache()
or persist
in your code, this might as well be 0. It‘s default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.String
and heavily nested structures (like Map
and nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. Choose WrappedArray
over nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT!Dataset
to cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all a Dataset
will likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.http://spark.apache.org/docs/1.2.1/configuration.html
EDIT: (So I can google myself easier) The following is also indicative of this problem:
java.lang.OutOfMemoryError : GC overhead limit exceeded
Answer2:
Have a look at the start up scripts a Java heap size is set there, it looks like you‘re not setting this before running Spark worker.
# Set SPARK_MEM if it isn‘t already set since we also use it for this process
SPARK_MEM=${SPARK_MEM:-512m}
export SPARK_MEM
# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS="$OUR_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"
You can find the documentation to deploy scripts here.
原文:http://www.cnblogs.com/warmingsun/p/6551461.html