Est. read time: 2 minutes | Last updated: September 23, 2024 by John Gentile


Contents

Overview

To write high-performance software (SW), you should understand computer architecture.

Latency numbers everyone should know:

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

Memory Hierarchy

High-Performance Networking

Packet FEC in lieu of Retransmission

When latency is key (can’t wait/block for packet loss) in lossy networks (e.x. WAN, intermittent links, etc.), Forward Error Correction (FEC) techniques (similar to those used at the physical layer) can be applied at the network layer. For instance in SD-WAN FEC, lost packets can be recovered on a link by sending extra “parity” packets for every $N$ packets. See more details on Information Theory.

Concurrency & Asynchronous Programming

Real-Time

Real-Time Operating Systems (RTOS)

Intrinsics

ISA Guides & Reference

GPU Processing

References

  • NVIDIA MatX: An efficient C++17 GPU numerical computing library with Python-like syntax

Profiling, Tracing and Benchmarking Tools

References