Learning Scala: Idempotent Operations in Distributed Systems

Jul 1

A language model can answer a question. An action model can do something about it. That distinction is going to matter more than most commerce platforms are ready for. Whether the term is "agent," "large action model," "task-specific AI," or something else that marketing teams have not finished sanding down yet, the direction is clear enough: software will not merely describe commercial options to buyers, operators, partners, and employees. It will increasingly select actions, call tools, negotiate constraints, and attempt to move workflows forward.

New to this series?

Catch up on earlier posts to follow along with the Functional Programming Isn’t Just for Academics series:

Post 1: Why Functional Programming Matters for the Systems We Build Today Post 2: Immutability by Default and the Foundation of Reliable Systems Post 3: Pure Functions: Your First Step Toward Bug-Free Concurrency

Each post in this series explores how teams use Scala to build applications that stay clean, testable, and easy to scale.

Whether that second command is harmless or a small disaster was decided long ago, not by the network, but by whoever modeled the operation. And the same question hides inside far more than the state of a lock. Provision a server, and a lost reply means you pay for two. Issue a customer the license they bought, dispatch a shipment, send a notification, publish a post, grant a permission… Every one is the same operation: an effect that leaves your control and cannot be pulled back, reached across a channel that can lose the receipt.

Why Retries Are the Rational Response to Silence

Underneath all of these is a single fact, and it has nothing to do with what the operation was about. Over an unreliable channel, the caller cannot tell “the operation failed” apart from “the operation succeeded and the response was lost.” From the outside the two are identical. You spoke into the dark and nothing came back.

The only safe response to I don’t know whether it worked is to do it again. Retries are not a malfunction. They are the rational answer to silence.

A person who gets no confirmation might refresh once and give up; an automated caller (or a partner integration, a job runner, an agent acting on someone’s behalf, etc.) retries instantly and relentlessly, the moment a reply goes missing. So the retry is a given. The only real question is what the operation does when it arrives. There are two kinds of answer, and the difference between them is the whole point of this piece.

Two Ways to Handle a Retry

Reaching Outward

One answer reaches outward. Put a dedupe table in front of the operation; route everything through a broker that promises to deliver each message once; bolt an idempotency layer onto the gateway. Infrastructure remembers on the operation’s behalf. The most respectable version of reaching outward is to wrap the whole thing in a transaction… and when the entire effect lives inside one database, you should, because a database can roll back, so a retry redoes nothing or redoes it cleanly. But that is the lucky case: the one place the effect can be taken back. The world cannot. You cannot un-unlock a door, un-send an email, un-ship a box, un-provision a server. The instant an effect has happened out in the world, rollback leaves the table, and with it every “just make it transactional” answer, including two-phase commit, which needs every participant to be a thing that can hold a change in suspense and undo it on command. Alas, most can’t do that.

Reaching Inward

The other answer reaches inward. Instead of surrounding the operation with machinery that remembers, it changes what the operation is, so that running it twice is the same as running it once so it stops mattering how many times the retry fires, because the second call produces nothing new to clean up. This is the functional move, though it rarely gets called that: make an effect behave like a function of its intent. A function, given the same input, returns the same result and changes nothing the second time you call it. An operation with that property doesn’t need a platform to remember for it. The remembering is built into its type.

What an Idempotency Key Actually Is

For an operation to behave like a function of its intent, it first has to know what its intent is: that this attempt and that earlier one mean the same act. That is all an idempotency key is, though the term makes it sound like a billing feature. It is a name for one specific intended act, generated by the caller and carried with the request: not “an unlock” but this unlock, for this tenant at this door at this moment.

scala

opaque type IntentId = String   // names one intended act; generated by the caller
final case class GrantAccess(
  intent:  IntentId,
  account: AccountId,
  scope:   Scope
)

The caller coins the name once, when the intent forms, and reuses it on every retry of that same intent. Two attempts at one act carry one name; two different acts carry two. Coin it in the wrong place (on the server, on receipt, for example) and every retry looks new, because the server has stamped each attempt as unique. The name has to originate with the intent and travel with it.

Modeling the Outcome as a Type

Now the act itself, and the move that makes it functional. What does it return? The naive answer is “the thing it produced, or an error.” That misses the case this whole piece is about. An act can succeed, an act can fail, and an act can be asked to do something it has already done. And the third is not an error. It is the operation working correctly under a retry. So make it a value:

scala

enum Outcome[A]:
  case Done(result: A)                // this attempt performed the act
  case AlreadyDone(result: A)  // an earlier attempt performed it; here is what it produced
 
def grant(cmd: GrantAccess): IO[GrantError, Outcome[License]]

The operation looks up the intent’s name. Never seen it: perform the act, record the result against the name, return Done. Seen it: do nothing at all, return AlreadyDone carrying the license the first attempt produced. The retry receives the same license. Not a new one, not nothing. One grant, one credential, one open door.

That is the whole idea, and it is worth saying plainly: an effectful operation has been made to behave like a function of its intent, that is, same intent in, same result out, the second call observably identical to the first. That is referential transparency, earned for an operation that moves the real world. And the signature is now honest in a way the usual one never is… def unlock(door: DoorId): Unit …is a lie: it hides that there is an effect, that it can fail, and that “success” has structure. IO[GrantError, Outcome[License]] tells the truth (effectful, typed in its failure, a sum in its success), and because AlreadyDone is a case the compiler makes you handle, the retry path and the first-attempt path cannot quietly diverge. The obvious way to handle it, give the customer their license, is also the correct one.

This is why recognition is so often a panicked retrofit: the operation was modeled as fire-and-forget: def notify(user: UserId): Unit. A Unit has nowhere to put “I have seen this before.” There is no seam. The retrofit becomes a tangle of pre-flight existence checks racing the real act, which is the outward answer at its worst: it guards the request and misses the duplicate that arrives by a retry the guard never saw.

The Atomic Write Requirement

The inward redesign has no such gap, but it rests on one detail: the record that says “this intent is done, here is what it produced” must be written as a single indivisible step with the local result you own. Do the act, then separately record it, crash in between, and the next retry won’t recognize the name and will act again. The proof has to be as durable as the deed. This, and not spanning the door and the carrier and the gateway, is the one job a transaction actually does here: bind “done” to its result in the single resource you control. Small, local, ordinary.

Idempotency Across Services You Do Not Own

The discipline so far protects one operation. Your world is made of many, scattered across services you own, services another team owns, and services another company runs in a cloud you have never logged into. The principle carries unchanged; what varies is how much of it you can enforce.

Start with the services you own: make each of them idempotent the way described above, and you have made yourself the one thing nobody else has to defend against. Anything that calls you should be able to ask twice and get one effect and the same answer back. You cannot govern how others behave. You can be the service that is always safe to ask again.

When you call someone else’s services there are three situations:

When It Supports Idempotency and Accepts a Key

It supports idempotency and it accepts a key. Derive a stable key from your intent and send it on every attempt. The only trap is minting a fresh key per call; a retry has to carry the same one, so coin it once, store it with your intent, and reuse it. Their machinery does the rest.

When It Doesn’t

It doesn’t. First, try to reshape the call into one that is idempotent whether the other side intended it or not. “Set the shipping address to X” survives repetition where “add an address” does not; creating a resource under a client-chosen unique id turns the second attempt into an “already exists” you can treat as success. A surprising number of non-idempotent APIs have an idempotent shape hiding in them if you reach for it. When none does, accept that you cannot make them safe, and plan to recover rather than prevent.

When You Don’t Know

You don’t know. Assume it does not, because that is the assumption that fails safe. You can’t make their operation idempotent, but you can make your calling of it idempotent: record the intent before you call, record the result after it returns, and on a retry consult your own record before reaching out again. You have wrapped an effect you don’t trust in a boundary you do. A window remains, between their effect and your record of it, which is the same atomic-write problem as before, now stretched across a wire. You shrink it, and you reconcile whatever slips through.

Orchestrating Across Systems You Only Partly Control

When the operation spans two or more of these at once, you are orchestrating, and the constraint is the one you already know: no transaction reaches across clouds and operators. So model the orchestration as something you can stop and resume. Record what has happened as you go, so a crash picks up where it left off instead of starting over… the running state is data you keep, not a position you lose. Keep every step idempotent, by the rules above, so resuming is always safe. Do the irreversible step last, after the steps you can still take back, so an early failure costs nothing permanent. And when something fails after the irreversible step, you do not undo it, you recover forward, with a deliberate compensating act, because the thing has happened and the only honest response is the next thing you choose to do about it.

None of this left the discipline. Each step is still an operation modeled as a function of its intent; the orchestration is still a value you fold to know where you stand, and a decision you can take again without harm. Correctness across a system you only partly control is the same property as correctness in a single operation, built into how you modeled it, not bought from the wire between the parts.

Exactly once is a story we tell because it is the behavior we want: do the thing, one time, done. Over any channel that can lose a reply (every channel?) it is a lie, and not only when the thing is money. It is a lie for a door, a license, a shipment, a server, a grant of authority. What you cannot take back, you can only recognize… and recognition is not something you buy and bolt on. It is something you model in. The reliability engineer reaches outward, for a platform that will remember on the code’s behalf. The functional answer reaches inward and makes the operation a function of its intent, so that asking twice and asking once arrive at the same value. Do that, and the retry stops being a second door flung open. It becomes what it should have been all along: the same answer, delivered twice.

This is Part 17 in an ongoing series. If you found this useful, Part 16 looks at how large action models need deliberately constrained action surfaces, and why a governed capability layer matters more than a broad pile of endpoints. Read "Large Action Models Need Small Action Surfaces"

Frequently Asked Questions

What is idempotency in distributed systems?

An operation is idempotent if running it multiple times produces the same result as running it once. In distributed systems, where network failures can prevent acknowledgments from reaching the caller, idempotency ensures that retrying a request does not cause unintended duplicate effects.

What is an idempotency key?

An idempotency key is a unique identifier, generated by the caller, that names one specific intended act. It travels with the request so the server can recognize whether a retry refers to an earlier attempt and return the same result without performing the operation again. The key must be generated once, when the intent forms, and reused on every retry of that same intent.

How do you model idempotent operations in Scala?

Model the operation’s result as a sum type with two cases: Done, for operations that performed the act on this attempt, and AlreadyDone, for operations that recognize a prior attempt and return the same result. The function signature IO[GrantError, Outcome[License]] makes both paths explicit and forces the caller to handle each one, so the retry path and the first-attempt path cannot quietly diverge.

Why does exactly-once delivery not exist?

Over any network channel that can lose a reply, the caller cannot distinguish between a failed operation and a succeeded operation whose acknowledgment was lost. The only safe response is to retry, which means operations are attempted at least once. Exactly-once delivery is a property you model into an operation, not one a network channel can guarantee.

How do you handle idempotency when calling third-party APIs that do not support idempotency keys?

First, try to find an idempotent shape for the call: setting a value rather than adding one, or creating a resource under a client-chosen unique ID. If no idempotent shape exists, record your intent before calling and your result after it returns. On a retry, consult your own record before reaching out again. This wraps an effect you do not control inside a boundary you do.

What is the relationship between idempotency and referential transparency?

A referentially transparent function returns the same result for the same input with no observable side effects from repetition. An operation modeled as a function of its intent achieves the same property for effectful code: same intent in, same result out, regardless of how many times it is called. This is referential transparency earned for code that moves the real world.

ScalaFunctional ProgrammingLearning Scala

Tony Moores