Est. read time: 4 minutes | Last updated: January 17, 2025 by John Gentile


Contents

Overview

To write high-performance software (SW), you should understand computer architecture.

Latency numbers everyone should know:

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

Memory Hierarchy

Concurrency & Asynchronous Programming

You mainly use concurrency in an application to separate concerns and/or to gain performance. Approaches to concurrency:

  • Multiple Processes: separate processes can use OS Interprocess Communication (IPC) features- like signals, sockets, files, pipes, etc.- to pass messages/data. A downisde is IPC can be complicated to setup or slow, and there’s overhead in running multiple processes (OS resources to manage and start). An advantage is IPC can be horizontally scalable, and processes can be run across machines on a network (e.x. when using socket IPC).
  • Multiple Threads: you can also run multiple threads within a single process, where all threads share the same address space, and most data can be accessed directly from all threads. This makes the overhead much smaller than sharing data across processes, but this also means software must be more aware of potential problems between threads operating on data concurrently. Threads can be launched much quicker than processes as well.
    • We can further divide up parallelism constructs from here into task parallelism (dividing tasks into multiple, concurrent parts) and data parallelism, where each thread can operate on different parts of data (also leading into SIMD hardware parallelism).

In many system languages, threads can be launched/spawned by pointing to a given function (or immediate work in lambda notation). When a thread is launched, it immediately starts doing work while program execution continues in the method that spawned it. We can use joining to wait for a thread to finish execution, and care must be kept for the scope/lifetime of a thread.

Problems with sharing data between threads comes down to consequences of modifying data across threads; if all shared data was read-only, there would be no issues. This can manifest as race conditions which occurs when completing an operation requires modifications of two or more distinct pieces of data between competing threads, and the relative timing can change at runtime. To mitigate this:

  • Locking: one option is to wrap the shared data structure with a protection mechanism to ensure that only the thread performing the modification (write) can see the intermediate states (breaking an invariant type). Other threads see this as modification has either already completed or hasn’t started yet.
    • Mutex: short for mutual exclusion, one uses this synchronization primitive to lock the mutex associated with a shared data structure when modifying, then unlock the mutex once complete. All other threads that try to access the data structure while the mutex is locked must wait till unlocked. The downside is mutexes can run into deadlock. Or if careful data protection is not considered (a method passes a pointer/reference to the data structure that should be protected by a mutex), a mutex could accidentally be bypassed (so don’t pass pointers to protected data outside the scope of a lock!).
  • Lock-Free: another option is lock-free programming where modifications on shared data are performed as atomic changes (indivisible operations).

Architectures

Tools

  • lscpu -C can show COHERENCY-SIZE as the “minimum amount of data in bytes transferred from memory to cache”.
  • Can show thread names in htop by F2 → Display options → Show custom thread names
  • Tool to measure core-to-core latency

References

Real-Time & Embedded

Real-Time Operating Systems (RTOS)

SIMD & Intrinsics

ISA Guides & Reference

GPU Processing

References

  • NVIDIA MatX: An efficient C++17 GPU numerical computing library with Python-like syntax

Performance Tuning

Profiling, Tracing and Benchmarking Tools

Linux Performance Optimizations

Besides the general/obvious things like stopping unnecessary applications, background services, etc. you can also look into:

  • Using taskset and nice to set a process’s CPU affinity (one or more specific core allocation(s)) and process scheduling priority, respectively.
    • You can launch a command/process with both using $ taskset -c 0,1 nice -20 <command>, which will launch <command> on cores 0 and 1 with highest scheduling priority.
  • Use cpuset and some other kernel techniques to completely isolate CPU core(s) from the Linux scheduler and/or other interrupts- in this way, you could place processes on that CPU completely isolated from other processes, theoretically uninterrupted. For example, see this SUSE labs tutorial on CPU Isolation.
    • NOTE: isolcpus is now deprecated in Linux kernel.
  • Use Hugepage (or transparent hugepage support) to bump up the size of pages from the default (nominally 4096 Bytes) to some larger size (popularly 2MB all the way to GB+) to optimize the Translation lookaside bufer (TLB) cache to have less misses/entries for virtual-physical memory paging.

References