Software Design for Performance
Est. read time: 2 minutes | Last updated: September 23, 2024 by John Gentile
Contents
- Overview
- Memory Hierarchy
- High-Performance Networking
- Concurrency & Asynchronous Programming
- Real-Time
- Intrinsics
- GPU Processing
- Profiling, Tracing and Benchmarking Tools
- References
Overview
To write high-performance software (SW), you should understand computer architecture.
Latency numbers everyone should know:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Memory Hierarchy
- What Every Programmer Should Know About Memory
- Cache coherence
- Why does the speed of memcpy() drop dramatically every 4KB? - StackOverflow
- Rust zerocopy crate
High-Performance Networking
- Data Plane Development Kit (DPDK)
- How to receive a million packets per second
- High Performance Browser Networking by Ilya Grigorik
- A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC - Paper on AWS Scalable Reliable Datagram (SRD)
Packet FEC in lieu of Retransmission
When latency is key (can’t wait/block for packet loss) in lossy networks (e.x. WAN, intermittent links, etc.), Forward Error Correction (FEC) techniques (similar to those used at the physical layer) can be applied at the network layer. For instance in SD-WAN FEC, lost packets can be recovered on a link by sending extra “parity” packets for every $N$ packets. See more details on Information Theory.
Concurrency & Asynchronous Programming
Real-Time
Real-Time Operating Systems (RTOS)
- FreeRTOS
- The Power of Ten - Rules for Developing Safety Critical Code, NASA/JPL
- GN&C Fault Protection Fundamentals, JPL & CalTech
Intrinsics
ISA Guides & Reference
- AMD Developer Guides, Manuals & ISA Documents
- Intel 64 and IA-32 Architectures Software Developer Manuals
GPU Processing
References
- NVIDIA MatX: An efficient C++17 GPU numerical computing library with Python-like syntax
Profiling, Tracing and Benchmarking Tools
- Flame Graphs
- KUtrace: Low-overhead tracing of all Linux kernel-user transitions, for serious performance analysis. Includes kernel patches, loadable module, and post-processing software.
References
- Numpy CPU/SIMD Optimizations
- Cache Prefetching
- Intel Tuning Guides and Performance Analysis Papers
- Brendan Gregg’s Website