Some Spark articles are worth deep reading:

spark-notes

  1. leave 1 core per node for Hadoop/Yarn/OS deamons
  2. leave 1G + 1 executor for Yarn ApplicationMaster
  3. 3-5 cores per executor for good HDFS throughput
Full memory requested to yarn per executor = spark.executor.memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor.memory)

So, if we request 15GB per executor, actually we got 15GB + 7% * 15GB = ~16G

MemoryOverhead

4 nodes
8 cores per node
50GB per node


1. 5 cores per executor: --executor-cores = 5
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/5 = 5
5. leave one executor for Yarn ApplicationMaster: --num-executors = 5-1 = 4
6. number of executors per node: 4/4 = 1
7. memory per executor: 50GB/1 = 50GB
8. cut heap overhead: 50GB - 7%*50GB = 46GB, --executor-memory=46GB

4 executors, 46GB and 5 cores each

1. 3 cores per executor: --executor-cores = 3
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/3 = 9
5. leave one executor for Yarn ApplicationMaster: --num-executors = 9-1 = 8
6. number of executors per node: 8/4 = 2
7. memory per executor: 50GB/2 = 25GB
8. cut heap overhead: 25GB * (1-7%) = 23GB, --executor-memory=23GB

8 executors, 23GB and 3 cores each

Spark + Cassandra, All You Need to Know: Tips and Optimizations

  1. Spark on HDFS has low cost, used in most cases
  2. Spark with Cassandra in same cluster, will have best performance in throughput and low latency
  3. Deploy Spark with an Apache Cassandra cluster
  4. Spark Cassandra Connector
  5. Cassandra Optimizations for Apache Spark

Spark Optimizations

  1. Narrow transformations than Wide transformations
  2. minimize data shuffles
  3. filter data as early as possible
  4. set the right number of partitions, 4x of partitions to the number of cores
  5. avoid data skew
  6. broadcast for small table joins
  7. repartition before expensive or multiple joins
  8. repartition before writing to storage
  9. be remember that repartition is an expensive operation
  10. set right number of executors, cores and memory
  11. get rid of the the Java Serialization, use Kryo Serialization
  12. Minimize data shuffles and maximize data locality
  13. Use Data Frames or Data Sets high level APIs to take advantages of the Spark optimizations
  14. Apache Spark Internals: Tips and Optimizations