Latency, load, and the craft of making fast feel stable

There is a useful distinction between low latency and stable latency. Low latency can be a benchmark artifact. Stable latency is a property of the system under variation. It accounts for queue behavior, cache misses, dependency slowdown, and the cost of state transitions that only appear under pressure.

In high-concurrency paths, response time is often shaped less by raw compute and more by contention management. If too many requests compete for the same inventory state, the question stops being “How fast is the endpoint?” and becomes “Where is this system allowed to say not yet?”

Latency is a systems property

I tend to think about latency as the visible surface of a deeper behavior model:

Concurrency control decides whether work proceeds, waits, or fails.
Queues decide how burst traffic is absorbed or deferred.
Caches decide whether repeated requests hit memory, storage, or coordination cost.

Once you look at the system this way, latency optimization becomes less about shaving milliseconds in isolation and more about shaping how the system behaves when demand stops being smooth.

Stable feels faster than fast

Users experience consistency before they experience theoretical peak performance. A system that responds in 50ms most of the time and then collapses unpredictably at peak traffic does not feel fast. It feels unreliable. A system that stays within an expected response envelope, even while shedding or sequencing work, feels much better in practice.

That is why I care about pressure paths as much as happy paths. Good performance work is not just about speed. It is about preserving confidence in how the system behaves when the load is real.