Spark Notes
Some Spark articles are worth deep reading:
spark-notes
- leave 1 core per node for Hadoop/Yarn/OS daemons
- leave 1G + 1 executor for Yarn ApplicationMaster
- 3-5 cores per executor for good HDFS throughput
Full memory requested to yarn per executor = spark.executor.memory + spark.yarn.executor.memoryOverhead
spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor.memory)
So, if we request 15GB per executor, actually we got 15GB + 7% * 15GB = ~16G

4 nodes
8 cores per node
50GB per node
1. 5 cores per executor: --executor-cores = 5
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/5 = 5
5. leave one executor for Yarn ApplicationMaster: --num-executors = 5-1 = 4
6. number of executors per node: 4/4 = 1
7. memory per executor: 50GB/1 = 50GB
8. cut heap overhead: 50GB - 7%*50GB = 46GB, --executor-memory=46GB
4 executors, 46GB and 5 cores each
1. 3 cores per executor: --executor-cores = 3
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/3 = 9
5. leave one executor for Yarn ApplicationMaster: --num-executors = 9-1 = 8
6. number of executors per node: 8/4 = 2
7. memory per executor: 50GB/2 = 25GB
8. cut heap overhead: 25GB * (1-7%) = 23GB, --executor-memory=23GB
8 executors, 23GB and 3 cores each
Spark + Cassandra, All You Need to Know: Tips and Optimizations
- Spark on HDFS has low cost, used in most cases
- Spark with Cassandra in same cluster, will have best performance in throughput and low latency
- Deploy Spark with an Apache Cassandra cluster
- Spark Cassandra Connector
- Cassandra Optimizations for Apache Spark
Spark Optimizations
- Narrow transformations than Wide transformations
- minimize data shuffles
- filter data as early as possible
- set the right number of partitions, 4x of partitions to the number of cores
- avoid data skew
- broadcast for small table joins
- repartition before expensive or multiple joins
- repartition before writing to storage
- be remember that repartition is an expensive operation
- set right number of executors, cores and memory
- get rid of the the Java Serialization, use Kryo Serialization
- Minimize data shuffles and maximize data locality
- Use Data Frames or Data Sets high level APIs to take advantages of the Spark optimizations
- Apache Spark Internals: Tips and Optimizations
~/.forward
echo '[email protected]' > ~/.forward
This will make smtpd forwards email to the special address. On AWS EC2, SES can be used to forward email to your Gmail.
Diff Patch Notes
- Create patch file:
diff -u file1 file2 > name.patch, orgit diff > name.patch - Apply path file:
patch [-u] < name.patch - Backup before apply patch:
patch -b < name.patch - Validate patch without apply:
patch --dry-run < name.patch - Reverse applied path:
patch -R < name.patch
8:24
R.I.P. KOBE
audit.sh
Use PROMPT_COMMAND for bash, and precmd for zsh.
mkdir -p /var/log/.audit
touch /var/log/.audit/audit.log
chown nobody:nobody /var/log/.audit/audit.log
chmod 002 /var/log/.audit/audit.log
chattr +a /var/log/.audit/audit.log
Save to /etc/profile.d/audit.sh:
HISTSIZE=500000
HISTTIMEFORMAT=" "
export HISTTIMEFORMAT
export HISTORY_FILE=/var/log/.audit/audit.log
export PROMPT_COMMAND='{ curr_hist=`history 1|awk "{print \\$1}"`;last_command=`history 1| awk "{\\$1=\"\" ;print}"`;user=`id -un`;user_info=(`who -u am i`);real_user=${user_info[0]};login_date=${user_info[2]};login_time=${user_info[3]};curr_path=`pwd`;login_ip=`echo $SSH_CONNECTION | awk "{print \\$1}"`;if [ ${login_ip}x == x ];then login_ip=- ; fi ;if [ ${curr_hist}x != ${last_hist}x ];then echo -E `date "+%Y-%m-%d %H:%M:%S"` $user\($real_user\) $login_ip [$login_date $login_time] [$curr_path] $last_command ;last_hist=$curr_hist;fi; } >> $HISTORY_FILE'
echo "local6.* /var/log/commands.log" > /etc/rsyslog.d/commands.conf
systemctl restart rsyslog.service
precmd() { eval 'RETURN_VAL=$?;logger -p local6.debug "$(whoami) [$$]: $(history | tail -n1 | sed "s/^[ ]*[0-9]\+[ ]*//" ) [$RETURN_VAL]"' }
GitHub Actions Canceled Unexpectedly
By default, GitHub will cancel all in-progress jobs if if any matrix job fails. Set fail-fast: false to fix this.
How to Activate Noise Cancellation with One AirPod
Settings - Accessibility - AirPods, toggle on Noise Cancellation with One AirPod.
AirPods Pro 开启单只降噪:设置 - 辅助功能 - AirPods,打开 一只 AirPod 入耳时使用降噪。
[转]服务端高并发分布式架构演进之路
原文 服务端高并发分布式架构演进之路,本文以淘宝作为例子,介绍从一百个并发到千万级并发情况下服务端的架构的演进过程,同时列举出每个演进阶段会遇到的相关技术,让大家对架构的演进有一个整体的认知,文章最后汇总了一些架构设计的原则。
架构设计的原则:
- N+1设计。系统中的每个组件都应做到没有单点故障;
- 回滚设计。确保系统可以向前兼容,在系统升级时应能有办法回滚版本;
- 禁用设计。应该提供控制具体功能是否可用的配置,在系统出现故障时能够快速下线功能;
- 监控设计。在设计阶段就要考虑监控的手段;
- 多活数据中心设计。若系统需要极高的高可用,应考虑在多地实施数据中心进行多活,至少在一个机房断电的情况下系统依然可用;
- 采用成熟的技术。刚开发的或开源的技术往往存在很多隐藏的bug,出了问题没有商业支持可能会是一个灾难;
- 资源隔离设计。应避免单一业务占用全部资源;
- 架构应能水平扩展。系统只有做到能水平扩展,才能有效避免瓶颈问题;
- 非核心则购买。非核心功能若需要占用大量的研发资源才能解决,则考虑购买成熟的产品;
- 使用商用硬件。商用硬件能有效降低硬件故障的机率;
- 快速迭代。系统应该快速开发小功能模块,尽快上线进行验证,早日发现问题大大降低系统交付的风险;
- 无状态设计。服务接口应该做成无状态的,当前接口的访问不依赖于接口上次访问的状态。
Google Code Review Guide
- 原则:给出技术上的建议,而不是个人偏好
- 写一个好的 commit:
- 第一行,改动的简短摘要
- 空行
- 详细提交信息
- 小修改,多提交
- 方便 review/merge/roll back
- 利于好的代码设计,减少 bug
- Code review 看什么?
- 设计
- 功能实现是否正确,以及复杂度
- 测试
- 命名,注释,代码风格,文档等
- 尽早 review,尽快 review
- 好的 code review comment
- Be kind
- 只指出问题,让开发人员自己决定怎么修改
- Encourage developers to simplify code
