Over the years, improving existing system is one of my biggest
passions, not only because I’m mainly working with infrastructure and
operations. To understand the system complicity and improve it, it’s as challenging
as black magic. I’d like to share some of my experiences in this area.
Step 1 – Overview. Before start using Google and search
“tuning tcp/ip on tomcat” or what ever you need to improve. Create a holistic
view, either top down or bottom up. From user to the database with all components
are used in the workflow. You might find later something least expected is
causing the latency. For example, a busy web cluster is doing million DNS
lookups to same backend server because not using DNS cache. But delay is not
caused by DNS lookups, it’s caused by the firewall between the layers randomly drop
UDP packages.
Step 2 – Measure. Tuning a complex system can drain enormous
resource and you might feel hopelessness or cluelessness from time to time. Key
to a success tuning is measuring. Not only from end to end, also between the
services and applications. If you cannot measure it, you cannot improve it,
more important you won’t know how much you’ve improved. Review the overview and
measure the interesting paths. More data you collect, easier to find the
bottleneck.
Step 3 – Bias Free. When the performance becomes an issue,
people tend to blame unknowns to protect themselves. Also we love to attack the
symptom rather than look couple steps further for the root cause. It might be
perfectly reasonable action, but real gain is usually by resolving the root
cause. In reality, we have short-term mitigation and long-term resolution. Bias
will blind your judgment and instinct. In many practice cases, the
problem symptom is only reflection of the real problem.
Last but not least, you need a great toolbox to complete
your mission. Depending on situation, you might need different tools. Here are
some of my favorites. logstash – a great central logging system
- tcpdump, dsniff, wireshark – packet sniffing
- graphite – graph tool, perfect for time-series data
- ab, siege – http load test tools
- new relic, tracelytics – full stack performance insight tools
- sysstat, vmstat, iostat, htop – OS level monitoring tools
- strace, vagrind, systemtap – Deep OS level troubleshooting tools
- iperf – TCP/UPD bandwidth measurement tool
- charles – client side web debugging proxy
No comments:
Post a Comment