Isolation Fails in Isolation

Why Systems Collapse Under Load (and How to Fix It)

Posted: 2026-03-29

Categories: development , Erlang , architecture , fintech , BEAM

The Gatekeeper Gnome

A recent essay, "The Isolation Trap", argues that Erlang's isolation model has structural limits: deadlocks from circular calls, unbounded mailboxes, and escape hatches like ETS. The HN discussion added production experience:

"All the problems I've had with Erlang have been related to full mailboxes or having one process type handling too many kinds of different messages."

The critique has merit. Circular gen_server:call chains can deadlock (strict archetypes prevent this by directing calls one way, down the supervision tree). The production failures worth solving are more specific.

In The Gnome Village I wrote: "Isolation contains damage. Sharing spreads it." That holds for crashes. Isolation does nothing about slowness. When hundreds of processes hit the same degraded API, they fail independently and simultaneously. Each gnome is an island. Islands that share a coastline flood together when the tide comes in.

Three things break:

Correlated timeouts. A slow dependency causes hundreds of processes to time out at once.
Retry amplification. Each timeout spawns a retry, multiplying load on an already degraded system.
Mailbox flooding. Processes accumulate messages faster than they drain them.

Once the perfect storm hits, even isolated islands flood. The illusion of perfect isolation shatters.

Three Gatekeepers

Consider a standard payment system under heavy load. To survive the flood, it needs a Gatekeeper at each failure point. Gatekeepers sit at domain boundaries where unbounded input meets bounded capacity. (See Process Archetypes for the full role taxonomy.)

You can't stop the tide, but you can decide where the water is allowed to flow.

A circuit breaker sits in front of the fraud API. When the dependency turns unreliable, the breaker trips after three consecutive failures, closes the gate, and starts exponential backoff. Callers get a fast, clear {error, breaker_blocked} instead of another cascading timeout.

A rate limiter on the worker pool uses a token bucket to cap concurrent requests, keeping the inflow below what the system can drain. When tokens are exhausted, callers immediately receive {error, rate_limited}, avoiding queue buildup and unbounded memory growth.

A sentinel (Observer) watches the water level. It monitors the worker group, and when failure rates cross a threshold, it alerts the payment coordinator to switch to a fallback path before the system collapses.

None of these patterns require shared mutable state. They use OTP primitives that already exist: gen_server state machines, supervisor restart budgets, and process monitors.

The Serialization Objection

The essay claims that when a process mailbox becomes a bottleneck, teams invariably reach for ETS, reintroducing the shared state that isolation was supposed to prevent.

But mailbox bottlenecks usually signal role confusion, not an inherent flaw in message passing. A Resource Owner that only holds state and answers questions can handle messages in microseconds. It serves thousands of Workers without a backlog.

It is only when that same process also calls databases, formats responses, and logs metrics that it backs up. Split those roles, and the serialized path stays fast. If demand still exceeds capacity, a Gatekeeper bounds the load long before anyone needs to reach for ETS.

ETS, persistent_term, and atomics certainly have legitimate uses. A counter for metrics collection is a controlled, pragmatic relaxation of the model. As one HN commenter put it:

"Message passing is a way of constraining shared memory to the point where it's possible for humans to reason about most of the time."

Full Circle

When a well-designed system faces the same traffic spike, the outcome changes. The system stays responsive, and the failure remains local rather than spreading across the whole system.

If you are building systems that handle money, or any other critical load, follow these three rules:

Put a circuit breaker in front of every external dependency.
Limit concurrency on every worker pool that calls external services.
Define backpressure contracts at every domain boundary.

Isolation gives you crash containment for free, but nothing is truly isolated. To survive slowness and load, you need a good architecture and solid building blocks. That is exactly what domains, flows, and process archetypes provide.

Golden Gate Bridge

Under the Golden Gate, the water moves fast because the channel is narrow. Systems behave the same way. Constrain the flow, and you get predictable movement. Remove the constraint, and the same water spreads out, slows down, and piles up elsewhere.

References

"The Isolation Trap" by zapwalrus (causality.blog). The essay this post responds to.
HN discussion (165 points, 70 comments).

Series: Gnomes, Domains, and Flows

The Gnome Village — processes, isolation, scheduling
Supervisors Are Managers — restart strategies, supervision trees
Gnomes, Domains, and Flows — processes + domains + flows
Domains Own Code and Data — domain boundaries, failure domain grouping
Flows Keep Work Moving — backpressure contracts, four flow types
Putting It Together — payments walkthrough

Series: Process Archetypes

Process Archetypes — the five roles
Workers — one job, pools, fan-out
Resource Owners — entity owners, aggregators
Routers — direction, no flow control
Gatekeepers — circuit breakers, rate limiters, flow controllers
Observers — sentinels, supervisors

For a deeper look at how the BEAM implements processes, heaps, message passing, scheduling, and garbage collection, see The BEAM Book.

Hacker's Handbook

Three Gatekeepers

The Serialization Objection

Full Circle

References

Series: Gnomes, Domains, and Flows

Series: Process Archetypes

Popular Posts

Schedule Your Free Strategic Consultation

Book Your Consultation

Sign In