Software Design for Performance

Est. read time: 8 minutes | Last updated: October 01, 2025 by John Gentile

Contents

Overview
Optimizations for Performant Software
Concurrency & Asynchronous Programming
Real-Time & Embedded
- Real-Time Operating Systems (RTOS)
Assembly, SIMD & Intrinsics
GPU/Other-Accelerator Processing
- NVIDIA CUDA
- General/Other References
Performance Tuning & Tools
References
- YouTube Videos

Overview

To write high-performance software (SW), you should fundamentally understand computer architecture and that layers of abstraction have major effects.

Latency numbers everyone should know:

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

Optimizations for Performant Software

Memory & Caching

lscpu -C can show COHERENCY-SIZE as the “minimum amount of data in bytes transferred from memory to cache”.

Branchless Programming

Unrolling your loops can improve branch prediction – Daniel Lemire’s blog
Vectorizing a loop by making it branchless
Fastest Branchless Binary Search - mhdm.dev
Branchless Programming: Why “If” is Sloowww… and what we can do about it!

Other Improvements & Tricks

Concurrency & Asynchronous Programming

You mainly use concurrency in an application to separate concerns and/or to gain performance. Approaches to concurrency:

Multiple Processes: separate processes can use OS Interprocess Communication (IPC) features- like signals, sockets, files, pipes, etc.- to pass messages/data. A downisde is IPC can be complicated to setup or slow, and there’s overhead in running multiple processes (OS resources to manage and start). An advantage is IPC can be horizontally scalable, and processes can be run across machines on a network (e.x. when using socket IPC).
Multiple Threads: you can also run multiple threads within a single process, where all threads share the same address space, and most data can be accessed directly from all threads. This makes the overhead much smaller than sharing data across processes, but this also means software must be more aware of potential problems between threads operating on data concurrently. Threads can be launched much quicker than processes as well.
- We can further divide up parallelism constructs from here into task parallelism (dividing tasks into multiple, concurrent parts) and data parallelism, where each thread can operate on different parts of data (also leading into SIMD hardware parallelism).
- A thread can be pinned to a specific core (called thread affinity) to avoid latency induced by context switching.

In many system languages, threads can be launched/spawned by pointing to a given function (or immediate work in lambda notation). When a thread is launched, it immediately starts doing work while program execution continues in the method that spawned it. We can use joining to wait for a thread to finish execution, and care must be kept for the scope/lifetime of a thread.

Problems with sharing data between threads comes down to consequences of modifying data across threads; if all shared data was read-only, there would be no issues. This can manifest as race conditions which occurs when completing an operation requires modifications of two or more distinct pieces of data between competing threads, and the relative timing can change at runtime. To mitigate this:

Locking: one option is to wrap the shared data structure with a protection mechanism to ensure that only the thread performing the modification (write) can see the intermediate states (breaking an invariant type). Other threads see this as modification has either already completed or hasn’t started yet.
- Mutex: short for mutual exclusion, one uses this synchronization primitive to lock the mutex associated with a shared data structure when modifying, then unlock the mutex once complete. All other threads that try to access the data structure while the mutex is locked must wait till unlocked. The downside is mutexes can run into deadlock. Or if careful data protection is not considered (a method passes a pointer/reference to the data structure that should be protected by a mutex), a mutex could accidentally be bypassed (so don’t pass pointers to protected data outside the scope of a lock!).
Lock-Free: another option is lock-free programming where modifications on shared data are performed as atomic changes (indivisible operations).

Architectures

Channels
Staged Event-Driven Architecture - Wikipedia
- Directed Acyclic Graph (DAG)
taskflow/taskflow: A General-purpose Parallel and Heterogeneous Task Programming System
The LMAX Architecture - Martin Fowler: ring buffer/queue model to allow concurrency without needing locks
- LMAX Disruptor
iceoryx2: Eclipse iceoryx2™ - true zero-copy inter-process-communication in pure Rust
CoralRing: an ultra-low-latency, lock-free, garbage-free, batching and concurrent circular queue (ring) in off-heap shared memory for inter-process communication (IPC) in Java across different JVMs using memory-mapped files.

Tools

Can show thread names in htop by F2 → Display options → Show custom thread names
Tool to measure core-to-core latency

References

C++ Concurrency in Action, Second Edition
What Every Systems Programmer Should Know About Concurrency
crossbeam-rs Learning Resources
Is Parallel Programming Hard, And, If So, What Can You Do About It? (Release v2023.06.11a)
C++11 threads, affinity and hyperthreading
Thread pool - Wikipedia
Communicating sequential processes - Hoare 1978: this classic paper on communicating with sequential processes as a fundamental structure for concurrent programs.

Real-Time & Embedded

Real-Time Operating Systems (RTOS)

Assembly, SIMD & Intrinsics

x86-64

Intel® Intrinsics Guide
Intel 64 and IA-32 Architectures Software Developer Manuals
Intel Tuning Guides and Performance Analysis Papers
uops.info: this website provides more than 700,000 pages with detailed latency, throughput, and port usage data for most instructions on many recent x86 microarchitectures
AMD Developer Guides, Manuals & ISA Documents
(Sub-matrix transpose / FFT) Intel® Advanced Vector Extensions 512 (Intel® AVX-512) - Permuting Data Within and Between AVX Registers Technology Guide
AVX512 (1 of 3): Introduction and Overview

ARM

General SIMD/ISA Reference

GPU/Other-Accelerator Processing

NVIDIA CUDA

How CUDA Programming Works - NVIDIA On-Demand
How GPU Computing Works - NVIDIA On-Demand
NVIDIA Developer Blog - Technical content: For developers, by developers
NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit
CUDA Toolkit Documentation
CUDA C++ Programming Guide
Best Practices Guide :: CUDA Toolkit Documentation
Unified Memory for CUDA Beginners - NVIDIA Developer Blog
NVIDIA MatX: An efficient C++17 GPU numerical computing library with Python-like syntax

General/Other References

Performance Tuning & Tools

Profiling, Tracing and Benchmarking Tools

godbolt: Compiler explorer to examine machine code output for various compile chains supporting a couple code languages (C++, D, Rust, and Go). Can also be used to compare the output of compiler autovectorization versus intrinsic usage.
Flame Graphs
- flamegraph-rs/flamegraph: easy flamegraphs for Rust projects and everything else
KUtrace: Low-overhead tracing of all Linux kernel-user transitions, for serious performance analysis. Includes kernel patches, loadable module, and post-processing software.
HPerf - Linux perf trace visualizer
janestreet/magic-trace: magic-trace collects and displays high-resolution traces of what a process is doing
KDAB/hotspot: The Linux perf GUI for performance analysis.
Fix Performance Bottlenecks with Intel® VTune™ Profiler

Linux Performance Optimizations

Besides the general/obvious things like stopping unnecessary applications, background services, etc. you can also look into:

Using taskset and nice to set a process’s CPU affinity (one or more specific core allocation(s)) and process scheduling priority, respectively.
- You can launch a command/process with both using $ taskset -c 0,1 nice -20 <command>, which will launch <command> on cores 0 and 1 with highest scheduling priority.
Use cpuset and some other kernel techniques to completely isolate CPU core(s) from the Linux scheduler and/or other interrupts- in this way, you could place processes on that CPU completely isolated from other processes, theoretically uninterrupted. For example, see this SUSE labs tutorial on CPU Isolation.
- NOTE: isolcpus is now deprecated in Linux kernel.
Use Hugepage (or transparent hugepage support) to bump up the size of pages from the default (nominally 4096 Bytes) to some larger size (popularly 2MB all the way to GB+) to optimize the Translation lookaside bufer (TLB) cache to have less misses/entries for virtual-physical memory paging.

Auto-Vectorization & High-Level Abstractions

References

YouTube Videos

When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024

Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - Fedor G Pikus - C++Now 2024

CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++”

CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?”

Branchless Programming in C++ - Fedor Pikus - CppCon 2021

Trading at light speed: designing low latency systems in C++ - David Gross - Meeting C++ 2022