High Concurrrency System Design

2023-12-20 1793 words 9 minutes

/posts/high-concurrency-system-design/featured-image.png

Contents

High concurrency is not a special architecture style. It is what happens when ordinary systems meet enough traffic, enough fan-out, or enough bad assumptions.

The reason many “high concurrency design” articles feel unhelpful is that they turn the topic into a glossary dump:

load balancers
caches
queues
sharding
circuit breakers
auto-scaling

All real things, but usually presented as if listing them were design.

In practice, high-concurrency work is mostly about answering four questions:

what actually breaks first?
what load can we afford at each layer?
how do we slow the system down safely instead of letting it collapse?
how do we know our assumptions were wrong before production tells us?

That is the frame for the rest of this article.

Start with bottlenecks, not components

Every system has a throughput ceiling somewhere. The mistake is pretending you can solve concurrency generically before finding where the pressure concentrates.

Common first bottlenecks:

database connection pool exhaustion
hot rows or hot partitions
cache miss storms
synchronous fan-out to multiple upstreams
CPU saturation on serialization, compression, crypto, or template work
lock contention inside application code
message consumer lag
disk I/O on logging, indexing, or persistence paths

Do not ask “should we add Kafka?” before you can answer “what blocks today under 10x traffic?”

A useful habit is to trace one high-value request end to end:

ingress
auth
cache
database
internal RPC calls
external APIs
async side effects
response serialization

Then ask where the request can queue, block, retry, or amplify load.

Capacity budgets are more useful than confidence

A lot of outages come from teams knowing their average traffic but not their actual budget.

You want rough numbers for at least:

sustained requests per second
burst tolerance
p95 / p99 latency targets
maximum acceptable queue depth
maximum DB queries per request
maximum upstream calls per request
connection pool sizes
memory headroom
cache hit-rate assumptions

These numbers do not need false precision. They do need to exist.

For example, if one request can trigger:

3 database reads
2 cache lookups
1 upstream call

then 2,000 requests per second is not “2,000 RPS.” It is a demand shape across several layers:

6,000 DB reads/s
4,000 cache ops/s
2,000 upstream calls/s

If retries kick in, that multiplication gets worse fast.

Queues are useful, but only if they absorb load instead of hiding it

Queues are one of the most abused concurrency tools.

They are useful for:

smoothing bursts
decoupling slower workers from synchronous request paths
absorbing temporary downstream slowness
moving non-interactive work off the critical path

They are dangerous when used as denial.

A queue is not a fix if:

producers can outpace consumers indefinitely
nobody monitors lag
downstream systems still cannot handle the eventual workload
retries re-enqueue poison messages forever
users still expect synchronous latency semantics

The right question is not “should we queue this?” but:

what work can leave the request path safely?
how much lag is acceptable?
what is the failure mode when the queue grows?
how do we stop accepting work when the backlog becomes meaningless?

A queue with no backpressure policy is just deferred failure.

Backpressure is not optional

If a system accepts work faster than it can complete work, then one of two things happens:

you apply backpressure deliberately
the system applies it for you by timing out, OOMing, or collapsing

Deliberate backpressure can look like:

bounded queue sizes
per-tenant rate limits
concurrency caps
admission control
producer throttling
rejecting low-priority work early
reducing fan-out under load

This is not a sign of weakness. It is what keeps the service useful for some requests instead of broken for all requests.

One of the most important design decisions is choosing where to shed load:

at the edge
at an internal work queue
per dependency
per customer tier
per feature class

If you do not choose, production will choose badly.

Load shedding should protect the important path first

Under pressure, not all work deserves equal treatment.

A sensible system distinguishes between:

user-facing core actions
eventually consistent side effects
analytics or logging extras
expensive enrichments
admin or low-priority background tasks

When concurrency rises, you often want to degrade in this order:

drop optional enrichments
defer non-critical side effects
reduce expensive cross-service fan-out
reject low-priority traffic
preserve the smallest possible core path

This is much better than letting all request classes drown together.

Idempotency becomes mandatory once retries enter the picture

High-concurrency systems retry constantly:

clients retry
load balancers retry
workers retry
humans retry by refreshing pages or pressing buttons again

If the operation is not idempotent, concurrency pressure turns small glitches into data corruption or duplicate side effects.

You should think explicitly about idempotency for:

payment-like actions
create operations
state transitions
message consumers
webhook receivers
scheduled jobs

Typical tools:

idempotency keys
deduplication windows
unique business identifiers
compare-and-swap semantics
exactly-once claims avoided in favor of at-least-once plus safe handlers

A retry-safe system is usually more important than a theoretically elegant one.

The database is still where many high-concurrency designs die

No amount of “modern architecture” changes the fact that a database under pressure becomes the truth-telling layer.

What usually hurts first:

too many concurrent connections
hot index pages
repeated small reads that should have been cached
writes serialized on a single key or row
transaction scope too wide
slow queries that are tolerable at low concurrency and fatal at high concurrency
lock contention amplified by retries

Things that help:

fewer round trips per request
tighter indexes
reducing transaction duration
avoiding read-after-write patterns unless needed
caching truly hot reads
separating analytical paths from transactional paths
moving non-critical writes async

Do not jump to sharding as a reflex. Most systems have a lot of room before that step if query shape and workload discipline are still sloppy.

Partitioning is a trade-off, not an upgrade badge

At some scale, partitioning or sharding is necessary. But it buys headroom by introducing new complexity:

uneven key distribution
hot partitions
harder rebalancing
more awkward cross-partition queries
trickier transactions
operational tooling requirements

Good partition keys spread load and keep common access paths local.

Bad partition keys create:

one overloaded partition
painful migrations
weird query workarounds
emergency repartitioning under traffic

Partitioning helps when the workload shape fits it. It does not help just because the table got large.

Cache behavior under load matters more than having a cache at all

A cache is not automatically a concurrency solution. A badly behaving cache can create worse failure modes than no cache.

Watch for:

cache stampedes on key expiry
very low hit rate despite operational complexity
hot key overload
stale data assumptions that are never written down
falling back to the database too aggressively on misses
expensive recomputation on miss

Useful techniques:

request coalescing
staggered expiry
jittered TTLs
background refresh for known hot keys
local plus shared cache layering where appropriate
explicit stale-while-revalidate behavior

A cache should reduce pressure, not synchronize it into bursts.

Tail latency is what users notice

Average latency is comforting and often misleading.

In distributed systems, p95 and p99 are where bad concurrency decisions show up:

one slow shard
one overloaded dependency
one GC pause
one queue backlog
one lock convoy
one expensive serialization path

If one request fans out to several internal dependencies, the chance that one of them is slow rises quickly. That is how “normal” systems end up with ugly tail behavior under load.

Design choices that help tail latency:

reducing fan-out
tighter timeouts
partial response strategies where acceptable
hedging only with care
better queue discipline
protecting hot dependencies from broad retries
avoiding giant payloads and giant query results

Users rarely care about your median. They care about whether the system sometimes feels broken.

Async workflows are for moving work, not escaping responsibility

A common design move is:

request arrives
write event
do the hard work later

Sometimes that is exactly right. Sometimes it just hides the SLA.

Async workflows are a strong fit when:

the user does not need immediate completion
the work is slow or bursty
downstream dependencies are flaky
partial completion is acceptable
the result can be polled or pushed later

But async design still requires:

clear operation state
retry policy
deduplication
poison message handling
visibility into lag
expiration / compensation rules

If you move work out of band, you still own it.

Observability needs to explain pressure, not just display dashboards

For high-concurrency systems, the observability goal is not “we have metrics.” It is “we can tell why latency or error rate changed under load.”

At minimum, instrument:

request rate
success / error / timeout rates
queue depth and message age
DB pool usage and wait time
cache hit/miss and hot-key patterns
upstream latency and failure rate
worker concurrency
p50 / p95 / p99 latencies
retry rate
shed load count / rejected requests

And make sure logs and traces can answer:

which dependency slowed down?
where did queueing start?
was the error from overload, timeout, or bad input?
did retries amplify the incident?

Dashboards are easy. Explanatory observability is harder and much more valuable.

Resilience testing should target your assumptions

You do not learn much by only load-testing the happy path.

Better tests include:

dependency slowdown, not only dependency failure
cache disabled or low hit-rate scenarios
queue consumer lag
hot-key concentration
skewed tenant traffic
duplicate message delivery
partial network partition behavior
thundering herd after recovery
bad rollout during peak traffic

The point is to test the assumptions your architecture quietly depends on.

Examples:

“the cache will absorb this”
“the queue can catch up later”
“the DB pool size is enough”
“retry storms will not happen”
“one tenant cannot dominate the shard”

Those assumptions should be attacked before production does it for you.

Common anti-patterns

Fan-out in the request path without a budget

Easy to build, ugly under load.

Unlimited queues

You are not increasing capacity; you are storing disappointment.

Retrying everything everywhere

This turns slowness into self-inflicted DDoS.

Treating sharding as a first move

Often a sign that query and schema discipline were skipped.

No idempotency for state-changing operations

Works fine until the first real retry storm.

Cache-as-faith

If nobody knows hit rate, stampede risk, or invalidation behavior, the cache is not a plan.

Autoscaling without bottleneck awareness

If the bottleneck is the database, scaling app instances can make things worse, not better.

Final take

Designing for high concurrency is not about collecting patterns. It is about protecting the system from its own demand shape.

The practical sequence is usually:

find the real bottleneck
define capacity budgets
move non-critical work off the hot path
apply bounded queues and backpressure
protect the database and cache from amplification
make retries safe with idempotency
measure tail latency, not just averages
test failure and skew, not just clean load

If you do that, “high concurrency” stops being a dramatic architecture label and becomes what it should be: disciplined systems engineering under pressure.