Software Design for Performance
Est. read time: 8 minutes | Last updated: March 31, 2025 by John Gentile
Contents
- Overview
- Optimizations for Performant Software
- Concurrency & Asynchronous Programming
- Real-Time & Embedded
- Assembly, SIMD & Intrinsics
- GPU/Other-Accelerator Processing
- Performance Tuning & Tools
- References
Overview
To write high-performance software (SW), you should understand computer architecture.
Latency numbers everyone should know:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Optimizations for Performant Software
Memory & Caching
lscpu -C
can show COHERENCY-SIZE
as the “minimum amount of data in bytes transferred from memory to cache”.
- What Every Programmer Should Know About Memory
- Gallery of Processor Cache Effects
- Why does the speed of memcpy() drop dramatically every 4KB? - StackOverflow
- Rust zerocopy crate
- The Mechanism behind Measuring Cache Access Latency
- Low Latency Optimization: Understanding Huge Pages (Part 1) - Hudson River Trading
- Locality of reference - Wikipedia
- Cache coherence - Wikipedia
- A Bounded SPSC queue for Rust
- Why is ringbuf crate so fast?- Reddit
Branchless Programming
- Unrolling your loops can improve branch prediction – Daniel Lemire’s blog
- Vectorizing a loop by making it branchless
- Fastest Branchless Binary Search - mhdm.dev
Branchless Programming: Why “If” is Sloowww… and what we can do about it!
Other Improvements & Tricks
Concurrency & Asynchronous Programming
You mainly use concurrency in an application to separate concerns and/or to gain performance. Approaches to concurrency:
- Multiple Processes: separate processes can use OS Interprocess Communication (IPC) features- like signals, sockets, files, pipes, etc.- to pass messages/data. A downisde is IPC can be complicated to setup or slow, and there’s overhead in running multiple processes (OS resources to manage and start). An advantage is IPC can be horizontally scalable, and processes can be run across machines on a network (e.x. when using socket IPC).
- Multiple Threads: you can also run multiple threads within a single process, where all threads share the same address space, and most data can be accessed directly from all threads. This makes the overhead much smaller than sharing data across processes, but this also means software must be more aware of potential problems between threads operating on data concurrently. Threads can be launched much quicker than processes as well.
- We can further divide up parallelism constructs from here into task parallelism (dividing tasks into multiple, concurrent parts) and data parallelism, where each thread can operate on different parts of data (also leading into SIMD hardware parallelism).
- A thread can be pinned to a specific core (called thread affinity) to avoid latency induced by context switching.
In many system languages, threads can be launched/spawned by pointing to a given function (or immediate work in lambda notation). When a thread is launched, it immediately starts doing work while program execution continues in the method that spawned it. We can use joining to wait for a thread to finish execution, and care must be kept for the scope/lifetime of a thread.
Problems with sharing data between threads comes down to consequences of modifying data across threads; if all shared data was read-only, there would be no issues. This can manifest as race conditions which occurs when completing an operation requires modifications of two or more distinct pieces of data between competing threads, and the relative timing can change at runtime. To mitigate this:
- Locking: one option is to wrap the shared data structure with a protection mechanism to ensure that only the thread performing the modification (write) can see the intermediate states (breaking an invariant type). Other threads see this as modification has either already completed or hasn’t started yet.
- Mutex: short for mutual exclusion, one uses this synchronization primitive to lock the mutex associated with a shared data structure when modifying, then unlock the mutex once complete. All other threads that try to access the data structure while the mutex is locked must wait till unlocked. The downside is mutexes can run into deadlock. Or if careful data protection is not considered (a method passes a pointer/reference to the data structure that should be protected by a mutex), a mutex could accidentally be bypassed (so don’t pass pointers to protected data outside the scope of a lock!).
- Lock-Free: another option is lock-free programming where modifications on shared data are performed as atomic changes (indivisible operations).
Architectures
- Staged Event-Driven Architecture - Wikipedia
- taskflow/taskflow: A General-purpose Parallel and Heterogeneous Task Programming System
- The LMAX Architecture - Martin Fowler: ring buffer/queue model to allow concurrency without needing locks
- iceoryx2: Eclipse iceoryx2™ - true zero-copy inter-process-communication in pure Rust
- CoralRing: an ultra-low-latency, lock-free, garbage-free, batching and concurrent circular queue (ring) in off-heap shared memory for inter-process communication (IPC) in Java across different JVMs using memory-mapped files.
Tools
- Can show thread names in htop by F2 → Display options → Show custom thread names
- Tool to measure core-to-core latency
References
- C++ Concurrency in Action, Second Edition
- What Every Systems Programmer Should Know About Concurrency
- crossbeam-rs Learning Resources
- Is Parallel Programming Hard, And, If So, What Can You Do About It? (Release v2023.06.11a)
- C++11 threads, affinity and hyperthreading
- Thread pool - Wikipedia
- Communicating sequential processes - Hoare 1978: this classic paper on communicating with sequential processes as a fundamental structure for concurrent programs.
Real-Time & Embedded
Real-Time Operating Systems (RTOS)
- FreeRTOS - Market leading RTOS (Real Time Operating System) for embedded systems with Internet of Things extensions
- The Power of Ten - Rules for Developing Safety Critical Code, NASA/JPL
- GN&C Fault Protection Fundamentals, JPL & CalTech
Assembly, SIMD & Intrinsics
x86-64
- Intel® Intrinsics Guide
- Intel 64 and IA-32 Architectures Software Developer Manuals
- Intel Tuning Guides and Performance Analysis Papers
- uops.info: this website provides more than 700,000 pages with detailed latency, throughput, and port usage data for most instructions on many recent x86 microarchitectures
- AMD Developer Guides, Manuals & ISA Documents
- (Sub-matrix transpose / FFT) Intel® Advanced Vector Extensions 512 (Intel® AVX-512) - Permuting Data Within and Between AVX Registers Technology Guide
AVX512 (1 of 3): Introduction and Overview
ARM
- ARMv8 AArch64/ARM64 Full Beginner’s Assembly Tutorial - MarioKartWii.com
- ARM Neon
- SIMD ISAs - Neon – Arm Developer
- GitHub - projectNe10/Ne10: An open optimized software library project for the ARM® Architecture
- DLTcollab/sse2neon: A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
General SIMD/ISA Reference
- SIMD for C++ Developers
- Daniel Lemire, Computer Science Professor
- Numpy CPU/SIMD Optimizations
- Understanding SIMD: Infinite Complexity of Trivial Problems
- FFmpeg School of Assembly Language
- NASM Assembly Language Tutorials - asmtutor.com
- Armadillo: C++ library for linear algebra & scientific computing
- blaze-lib / blaze — Bitbucket
- google/highway: Performance-portable, length-agnostic SIMD with runtime dispatch
- xtensor-stack/xsimd: C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE))
- kfrlib/kfr: Fast, modern C++ DSP framework, FFT, Sample Rate Conversion, FIR/IIR/Biquad Filters (SSE, AVX, AVX-512, ARM NEON)
- aff3ct/MIPP: MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX and AVX-512.
- ermig1979/Simd: C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.
- OTFFT – FFT library using AVX that is faster than FFTW
- vectorclass/version2: Vector class library, latest version
- rust-lang/portable-simd: The testing ground for the future of portable SIMD in Rust
- simd-everywhere/simde
GPU/Other-Accelerator Processing
NVIDIA CUDA
- How CUDA Programming Works - NVIDIA On-Demand
- How GPU Computing Works - NVIDIA On-Demand
- NVIDIA Developer Blog - Technical content: For developers, by developers
- NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit
- CUDA Toolkit Documentation
- CUDA C++ Programming Guide
- Best Practices Guide :: CUDA Toolkit Documentation
- Unified Memory for CUDA Beginners - NVIDIA Developer Blog
- NVIDIA MatX: An efficient C++17 GPU numerical computing library with Python-like syntax
General/Other References
- OpenMP for GPU offloading — OpenMP for GPU offloading documentation
- GPU Programming: When, Why and How? — GPU programming: why, when and how? documentation
- Learn OpenGL, extensive tutorial resource for learning Modern OpenGL
- Computer Graphics from Scratch - Gabriel Gambetta
- Hands On OpenCL by HandsOnOpenCL
- GitHub - arrayfire/arrayfire: ArrayFire: a general purpose GPU library.
- MAGMA
- BLAS Tutorial
- AMD-Xilinx AI Engine Development Tutorials
- Intel SYCL llvm/GetStartedGuide.md
- KhronosGroup/Vulkan-Samples: One stop solution for all Vulkan samples
- The Best GPUs for Deep Learning in 2023 — An In-depth Analysis
Performance Tuning & Tools
Profiling, Tracing and Benchmarking Tools
- godbolt: Compiler explorer to examine machine code output for various compile chains supporting a couple code languages (C++, D, Rust, and Go). Can also be used to compare the output of compiler autovectorization versus intrinsic usage.
- Flame Graphs
- KUtrace: Low-overhead tracing of all Linux kernel-user transitions, for serious performance analysis. Includes kernel patches, loadable module, and post-processing software.
- HPerf - Linux perf trace visualizer
- janestreet/magic-trace: magic-trace collects and displays high-resolution traces of what a process is doing
- KDAB/hotspot: The Linux perf GUI for performance analysis.
- Fix Performance Bottlenecks with Intel® VTune™ Profiler
Linux Performance Optimizations
Besides the general/obvious things like stopping unnecessary applications, background services, etc. you can also look into:
- Using taskset and nice to set a process’s CPU affinity (one or more specific core allocation(s)) and process scheduling priority, respectively.
- You can launch a command/process with both using
$ taskset -c 0,1 nice -20 <command>
, which will launch<command>
on cores0
and1
with highest scheduling priority.
- You can launch a command/process with both using
- Use
cpuset
and some other kernel techniques to completely isolate CPU core(s) from the Linux scheduler and/or other interrupts- in this way, you could place processes on that CPU completely isolated from other processes, theoretically uninterrupted. For example, see this SUSE labs tutorial on CPU Isolation.- NOTE:
isolcpus
is now deprecated in Linux kernel.
- NOTE:
- Use Hugepage (or transparent hugepage support) to bump up the size of pages from the default (nominally
4096
Bytes) to some larger size (popularly2MB
all the way to GB+) to optimize the Translation lookaside bufer (TLB) cache to have less misses/entries for virtual-physical memory paging.
Auto-Vectorization & High-Level Abstractions
- Numba: A High Performance Python Compiler
- taichi-dev/taichi: Productive and portable programming language for high-performance, sparse and differentiable computing on CPUs and GPUs
- Dask - Scale the Python tools you love
- Auto-Vectorization in LLVM — LLVM documentation
- High Performance Data Analytics in Python — HPDA-Python documentation
- Making Python 100x faster with less than 100 lines of Rust
References
- Agner Fog - Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X
- Performance-Aware Programming Series - Casey Muratori
- Performance Engineering of Software Systems - MIT OpenCourseWare, 2018
- Brendan Gregg’s Website
- Algorithms for Modern Hardware - Algorithmica
- dendibakh/perf-ninja: This is an online course where you can learn and master the skill of low-level performance analysis and tuning.
- C++ Design Patterns for Low-latency Applications Including High-frequency Trading
- Performance Analysis and Tuning on Modern CPUs - Denis Bakhvalov
- The Art of Writing Efficient Programs: An advanced programmer’s guide to efficient hardware utilization and compiler optimizations using C++ examples: 9781800208117: Pikus, Fedor G.: Books
- Introduction to High Performance Scientific Computing
YouTube Videos
When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024
Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - Fedor G Pikus - C++Now 2024
CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”
CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?”
Branchless Programming in C++ - Fedor Pikus - CppCon 2021
Trading at light speed: designing low latency systems in C++ - David Gross - Meeting C++ 2022