Spark Notes

Some Spark articles are worth deep reading:

spark-notes

leave 1 core per node for Hadoop/Yarn/OS deamons
leave 1G + 1 executor for Yarn ApplicationMaster
3-5 cores per executor for good HDFS throughput

Full memory requested to yarn per executor = spark.executor.memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor.memory)

So, if we request 15GB per executor, actually we got 15GB + 7% * 15GB = ~16G

MemoryOverhead

nodes
cores per node
50GB per node


5 cores per executor: --executor-cores = 5
num cores available per node: 8-1 = 7
total available cores in cluster: 7 * 4 = 28
available executors: (total cores/num-cores-per-executor), 28/5 = 5
leave one executor for Yarn ApplicationMaster: --num-executors = 5-1 = 4
number of executors per node: 4/4 = 1
memory per executor: 50GB/1 = 50GB
cut heap overhead: 50GB - 7%*50GB = 46GB, --executor-memory=46GB

executors, 46GB and 5 cores each

3 cores per executor: --executor-cores = 3
num cores available per node: 8-1 = 7
total available cores in cluster: 7 * 4 = 28
available executors: (total cores/num-cores-per-executor), 28/3 = 9
leave one executor for Yarn ApplicationMaster: --num-executors = 9-1 = 8
number of executors per node: 8/4 = 2
memory per executor: 50GB/2 = 25GB
cut heap overhead: 25GB * (1-7%) = 23GB, --executor-memory=23GB

executors, 23GB and 3 cores each

Spark + Cassandra, All You Need to Know: Tips and Optimizations

Spark on HDFS has low cost, used in most cases
Spark with Cassandra in same cluster, will have best performance in throughput and low latency
Deploy Spark with an Apache Cassandra cluster
Spark Cassandra Connector
Cassandra Optimizations for Apache Spark

Spark Optimizations

Narrow transformations than Wide transformations
minimize data shuffles
filter data as early as possible
set the right number of partitions, 4x of partitions to the number of cores
avoid data skew
broadcast for small table joins
repartition before expensive or multiple joins
repartition before writing to storage
be remember that repartition is an expensive operation
set right number of executors, cores and memory
get rid of the the Java Serialization, use Kryo Serialization
Minimize data shuffles and maximize data locality
Use Data Frames or Data Sets high level APIs to take advantages of the Spark optimizations
Apache Spark Internals: Tips and Optimizations