Spark Notes

Some Spark articles are worth deep reading:

spark-notes

  1. leave 1 core per node for Hadoop/Yarn/OS daemons
  2. leave 1G + 1 executor for Yarn ApplicationMaster
  3. 3-5 cores per executor for good HDFS throughput
Full memory requested to yarn per executor = spark.executor.memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor.memory)

So, if we request 15GB per executor, actually we got 15GB + 7% * 15GB = ~16G

MemoryOverhead

4 nodes
8 cores per node
50GB per node


1. 5 cores per executor: --executor-cores = 5
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/5 = 5
5. leave one executor for Yarn ApplicationMaster: --num-executors = 5-1 = 4
6. number of executors per node: 4/4 = 1
7. memory per executor: 50GB/1 = 50GB
8. cut heap overhead: 50GB - 7%*50GB = 46GB, --executor-memory=46GB

4 executors, 46GB and 5 cores each

1. 3 cores per executor: --executor-cores = 3
2. num cores available per node: 8-1 = 7
3. total available cores in cluster: 7 * 4 = 28
4. available executors: (total cores/num-cores-per-executor), 28/3 = 9
5. leave one executor for Yarn ApplicationMaster: --num-executors = 9-1 = 8
6. number of executors per node: 8/4 = 2
7. memory per executor: 50GB/2 = 25GB
8. cut heap overhead: 25GB * (1-7%) = 23GB, --executor-memory=23GB

8 executors, 23GB and 3 cores each

Spark + Cassandra, All You Need to Know: Tips and Optimizations

  1. Spark on HDFS has low cost, used in most cases
  2. Spark with Cassandra in same cluster, will have best performance in throughput and low latency
  3. Deploy Spark with an Apache Cassandra cluster
  4. Spark Cassandra Connector
  5. Cassandra Optimizations for Apache Spark

Spark Optimizations

  1. Narrow transformations than Wide transformations
  2. minimize data shuffles
  3. filter data as early as possible
  4. set the right number of partitions, 4x of partitions to the number of cores
  5. avoid data skew
  6. broadcast for small table joins
  7. repartition before expensive or multiple joins
  8. repartition before writing to storage
  9. be remember that repartition is an expensive operation
  10. set right number of executors, cores and memory
  11. get rid of the the Java Serialization, use Kryo Serialization
  12. Minimize data shuffles and maximize data locality
  13. Use Data Frames or Data Sets high level APIs to take advantages of the Spark optimizations
  14. Apache Spark Internals: Tips and Optimizations

~/.forward

echo '[email protected]' > ~/.forward

This will make smtpd forwards email to the special address. On AWS EC2, SES can be used to forward email to your Gmail.

AWS Data Transfer Costs

Diff Patch Notes

  1. Create patch file: diff -u file1 file2 > name.patch, or git diff > name.patch
  2. Apply path file: patch [-u] < name.patch
  3. Backup before apply patch: patch -b < name.patch
  4. Validate patch without apply: patch --dry-run < name.patch
  5. Reverse applied path: patch -R < name.patch

8:24

R.I.P. KOBE

8:24

audit.sh

Use PROMPT_COMMAND for bash, and precmd for zsh.

mkdir -p /var/log/.audit
touch /var/log/.audit/audit.log
chown nobody:nobody /var/log/.audit/audit.log
chmod 002 /var/log/.audit/audit.log
chattr +a /var/log/.audit/audit.log

Save to /etc/profile.d/audit.sh

HISTSIZE=500000
HISTTIMEFORMAT=" "
export HISTTIMEFORMAT
export HISTORY_FILE=/var/log/.audit/audit.log
export PROMPT_COMMAND='{ curr_hist=`history 1|awk "{print \\$1}"`;last_command=`history 1| awk "{\\$1=\"\" ;print}"`;user=`id -un`;user_info=(`who -u am i`);real_user=${user_info[0]};login_date=${user_info[2]};login_time=${user_info[3]};curr_path=`pwd`;login_ip=`echo $SSH_CONNECTION | awk "{print \\$1}"`;if [ ${login_ip}x == x ];then login_ip=- ; fi ;if [ ${curr_hist}x != ${last_hist}x ];then echo -E `date "+%Y-%m-%d %H:%M:%S"` $user\($real_user\) $login_ip [$login_date $login_time] [$curr_path] $last_command ;last_hist=$curr_hist;fi; } >> $HISTORY_FILE'
echo "local6.*  /var/log/commands.log" > /etc/rsyslog.d/commands.conf
systemctl restart rsyslog.service
precmd() { eval 'RETURN_VAL=$?;logger -p local6.debug "$(whoami) [$$]: $(history | tail -n1 | sed "s/^[ ]*[0-9]\+[ ]*//" ) [$RETURN_VAL]"' }

GitHub Actions Canceled Unexpectedly

How to Activate Noise Cancellation with One AirPod

Settings - Accessibility - AirPods, toggle on Noise Cancellation with One AirPod.

AirPods Pro 开启单只降噪:设置 - 辅助功能 - AirPods,打开 一只 AirPod 入耳时使用降噪

[转]服务端高并发分布式架构演进之路

原文 服务端高并发分布式架构演进之路,本文以淘宝作为例子,介绍从一百个并发到千万级并发情况下服务端的架构的演进过程,同时列举出每个演进阶段会遇到的相关技术,让大家对架构的演进有一个整体的认知,文章最后汇总了一些架构设计的原则。

  1. 0
  2. 1
  3. 2
  4. 3
  5. 4
  6. 5
  7. 6
  8. 7
  9. 8
  10. 9
  11. 10
  12. 11
  13. 12
  14. 13
  15. 14

架构设计的原则:

  1. N+1设计。系统中的每个组件都应做到没有单点故障;
  2. 回滚设计。确保系统可以向前兼容,在系统升级时应能有办法回滚版本;
  3. 禁用设计。应该提供控制具体功能是否可用的配置,在系统出现故障时能够快速下线功能;
  4. 监控设计。在设计阶段就要考虑监控的手段;
  5. 多活数据中心设计。若系统需要极高的高可用,应考虑在多地实施数据中心进行多活,至少在一个机房断电的情况下系统依然可用;
  6. 采用成熟的技术。刚开发的或开源的技术往往存在很多隐藏的bug,出了问题没有商业支持可能会是一个灾难;
  7. 资源隔离设计。应避免单一业务占用全部资源;
  8. 架构应能水平扩展。系统只有做到能水平扩展,才能有效避免瓶颈问题;
  9. 非核心则购买。非核心功能若需要占用大量的研发资源才能解决,则考虑购买成熟的产品;
  10. 使用商用硬件。商用硬件能有效降低硬件故障的机率;
  11. 快速迭代。系统应该快速开发小功能模块,尽快上线进行验证,早日发现问题大大降低系统交付的风险;
  12. 无状态设计。服务接口应该做成无状态的,当前接口的访问不依赖于接口上次访问的状态。

Google Code Review Guide

  1. 原则:给出技术上的建议,而不是个人偏好
  2. 写一个好的 commit:
    1. 第一行,改动的简短摘要
    2. 空行
    3. 详细提交信息
  3. 小修改,多提交
    1. 方便 review/merge/roll back
    2. 利于好的代码设计,减少 bug
  4. Code review 看什么?
    1. 设计
    2. 功能实现是否正确,以及复杂度
    3. 测试
    4. 命名,注释,代码风格,文档等
  5. 尽早 review,尽快 review
  6. 好的 code review comment
    1. Be kind
    2. 只指出问题,让开发人员自己决定怎么修改
    3. Encourage developers to simplify code