Software Design for Performance
Est. read time: 4 minutes | Last updated: January 17, 2025 by John Gentile
Contents
- Overview
- Memory Hierarchy
- Concurrency & Asynchronous Programming
- Real-Time & Embedded
- SIMD & Intrinsics
- GPU Processing
- Performance Tuning
- References
Overview
To write high-performance software (SW), you should understand computer architecture.
Latency numbers everyone should know:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Memory Hierarchy
- What Every Programmer Should Know About Memory
- Cache coherence
- Why does the speed of memcpy() drop dramatically every 4KB? - StackOverflow
- Rust zerocopy crate
- The Mechanism behind Measuring Cache Access Latency
Concurrency & Asynchronous Programming
You mainly use concurrency in an application to separate concerns and/or to gain performance. Approaches to concurrency:
- Multiple Processes: separate processes can use OS Interprocess Communication (IPC) features- like signals, sockets, files, pipes, etc.- to pass messages/data. A downisde is IPC can be complicated to setup or slow, and there’s overhead in running multiple processes (OS resources to manage and start). An advantage is IPC can be horizontally scalable, and processes can be run across machines on a network (e.x. when using socket IPC).
- Multiple Threads: you can also run multiple threads within a single process, where all threads share the same address space, and most data can be accessed directly from all threads. This makes the overhead much smaller than sharing data across processes, but this also means software must be more aware of potential problems between threads operating on data concurrently. Threads can be launched much quicker than processes as well.
- We can further divide up parallelism constructs from here into task parallelism (dividing tasks into multiple, concurrent parts) and data parallelism, where each thread can operate on different parts of data (also leading into SIMD hardware parallelism).
In many system languages, threads can be launched/spawned by pointing to a given function (or immediate work in lambda notation). When a thread is launched, it immediately starts doing work while program execution continues in the method that spawned it. We can use joining to wait for a thread to finish execution, and care must be kept for the scope/lifetime of a thread.
Problems with sharing data between threads comes down to consequences of modifying data across threads; if all shared data was read-only, there would be no issues. This can manifest as race conditions which occurs when completing an operation requires modifications of two or more distinct pieces of data between competing threads, and the relative timing can change at runtime. To mitigate this:
- Locking: one option is to wrap the shared data structure with a protection mechanism to ensure that only the thread performing the modification (write) can see the intermediate states (breaking an invariant type). Other threads see this as modification has either already completed or hasn’t started yet.
- Mutex: short for mutual exclusion, one uses this synchronization primitive to lock the mutex associated with a shared data structure when modifying, then unlock the mutex once complete. All other threads that try to access the data structure while the mutex is locked must wait till unlocked. The downside is mutexes can run into deadlock. Or if careful data protection is not considered (a method passes a pointer/reference to the data structure that should be protected by a mutex), a mutex could accidentally be bypassed (so don’t pass pointers to protected data outside the scope of a lock!).
- Lock-Free: another option is lock-free programming where modifications on shared data are performed as atomic changes (indivisible operations).
Architectures
- Staged Event-Driven Architecture - Wikipedia
- The LMAX Architecture - Martin Fowler: ring buffer/queue model to allow concurrency without needing locks
Tools
lscpu -C
can showCOHERENCY-SIZE
as the “minimum amount of data in bytes transferred from memory to cache”.- Can show thread names in htop by F2 → Display options → Show custom thread names
- Tool to measure core-to-core latency
References
- C++ Concurrency in Action, Second Edition
- What Every Systems Programmer Should Know About Concurrency
- crossbeam-rs Learning Resources
- Is Parallel Programming Hard, And, If So, What Can You Do About It? (Release v2023.06.11a)
- C++11 threads, affinity and hyperthreading
Real-Time & Embedded
Real-Time Operating Systems (RTOS)
- FreeRTOS - Market leading RTOS (Real Time Operating System) for embedded systems with Internet of Things extensions
- The Power of Ten - Rules for Developing Safety Critical Code, NASA/JPL
- GN&C Fault Protection Fundamentals, JPL & CalTech
SIMD & Intrinsics
ISA Guides & Reference
- AMD Developer Guides, Manuals & ISA Documents
- Intel 64 and IA-32 Architectures Software Developer Manuals
- Numpy CPU/SIMD Optimizations
- Understanding SIMD: Infinite Complexity of Trivial Problems
GPU Processing
References
- NVIDIA MatX: An efficient C++17 GPU numerical computing library with Python-like syntax
Performance Tuning
Profiling, Tracing and Benchmarking Tools
- Flame Graphs
- KUtrace: Low-overhead tracing of all Linux kernel-user transitions, for serious performance analysis. Includes kernel patches, loadable module, and post-processing software.
Linux Performance Optimizations
Besides the general/obvious things like stopping unnecessary applications, background services, etc. you can also look into:
- Using taskset and nice to set a process’s CPU affinity (one or more specific core allocation(s)) and process scheduling priority, respectively.
- You can launch a command/process with both using
$ taskset -c 0,1 nice -20 <command>
, which will launch<command>
on cores0
and1
with highest scheduling priority.
- You can launch a command/process with both using
- Use
cpuset
and some other kernel techniques to completely isolate CPU core(s) from the Linux scheduler and/or other interrupts- in this way, you could place processes on that CPU completely isolated from other processes, theoretically uninterrupted. For example, see this SUSE labs tutorial on CPU Isolation.- NOTE:
isolcpus
is now deprecated in Linux kernel.
- NOTE:
- Use Hugepage (or transparent hugepage support) to bump up the size of pages from the default (nominally
4096
Bytes) to some larger size (popularly2MB
all the way to GB+) to optimize the Translation lookaside bufer (TLB) cache to have less misses/entries for virtual-physical memory paging.