Spark Notes
Some Spark articles are worth deep reading:
spark-notes
- leave 1 core per node for Hadoop/Yarn/OS deamons
- leave 1G + 1 executor for Yarn ApplicationMaster
- 3-5 cores per executor for good HDFS throughput
Full memory requested to yarn per executor = spark.executor.memory + spark.yarn.executor.memoryOverhead
spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor.memory)
So, if we request 15GB per executor, actually we got 15GB + 7% * 15GB = ~16G
4 nodes
8 cores per node
50GB per node
1. 5 cores per executor: --executor-cores = 5
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/5 = 5
5. leave one executor for Yarn ApplicationMaster: --num-executors = 5-1 = 4
6. number of executors per node: 4/4 = 1
7. memory per executor: 50GB/1 = 50GB
8. cut heap overhead: 50GB - 7%*50GB = 46GB, --executor-memory=46GB
4 executors, 46GB and 5 cores each
1. 3 cores per executor: --executor-cores = 3
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/3 = 9
5. leave one executor for Yarn ApplicationMaster: --num-executors = 9-1 = 8
6. number of executors per node: 8/4 = 2
7. memory per executor: 50GB/2 = 25GB
8. cut heap overhead: 25GB * (1-7%) = 23GB, --executor-memory=23GB
8 executors, 23GB and 3 cores each
Spark + Cassandra, All You Need to Know: Tips and Optimizations
- Spark on HDFS has low cost, used in most cases
- Spark with Cassandra in same cluster, will have best performance in throughput and low latency
- Deploy Spark with an Apache Cassandra cluster
- Spark Cassandra Connector
- Cassandra Optimizations for Apache Spark
Spark Optimizations
- Narrow transformations than Wide transformations
- minimize data shuffles
- filter data as early as possible
- set the right number of partitions, 4x of partitions to the number of cores
- avoid data skew
- broadcast for small table joins
- repartition before expensive or multiple joins
- repartition before writing to storage
- be remember that repartition is an expensive operation
- set right number of executors, cores and memory
- get rid of the the Java Serialization, use Kryo Serialization
- Minimize data shuffles and maximize data locality
- Use Data Frames or Data Sets high level APIs to take advantages of the Spark optimizations
- Apache Spark Internals: Tips and Optimizations
Was this page helpful?