Idempotency, retries, and correlation IDs — the habits I wish I'd learned before my first production message loss.
I remember the first time I was responsible for losing a message in a production queue. A payment confirmation from an upstream provider came in, our handler threw on a null field we didn't expect, the message was acked anyway, and nobody noticed for about four hours — right up until the support ticket showed up.
That single bug taught me more about event-driven architecture than any book had. The failure modes are sneaky because most of the time everything works. And then one day, it doesn't, and you're trying to reconstruct what happened from half a log file.
Here's what I wish I'd known earlier.
At-least-once isn't a problem, it's the deal
Most queues and event buses give you at-least-once delivery. People tend to read that as a limitation. It's not. It's a contract. You can either design around duplicates or pretend they won't happen, and only one of those strategies actually works.
Every consumer I write now assumes it will be called more than once with the same message. Usually that means a unique idempotency key — the upstream event ID, a composite of tenant plus operation, or something similarly stable. The handler checks whether that key has already been processed before doing anything side-effectful. If it has, the handler returns successfully without repeating the work.
The subtle win is that this also makes replay safe. Which matters a lot when you need it.
Write the intent down before you do the work
The other habit I've picked up: persist the intent of an event before calling any external system.
That sounds like extra work, but it's the difference between "we crashed mid-operation and now we're in an unknown state" and "we crashed mid-operation and can pick up exactly where we left off." Writing intent first means your database is always the source of truth about what's supposed to happen, and the external call becomes something you can retry until it succeeds.
This pattern — sometimes called the outbox pattern, sometimes just "do it right" — is the single biggest reliability upgrade I've shipped in async systems.
Correlation IDs are free superpowers
When something goes wrong in an async pipeline, the hardest part isn't usually the fix. It's figuring out what actually happened. A message came in, it got processed, something downstream tripped — which hop was responsible?
Tag every event with a correlation ID the moment it enters your system and propagate it everywhere: logs, traces, downstream API calls, retries. The cost is basically zero. The payoff is being able to answer "what happened to this specific transaction?" in seconds instead of hours.
A quiet rule I live by
If I can't explain in one paragraph what happens when a message is delivered twice, I'm not done designing the system. That one question forces you to think about idempotency, ordering, side effects, and observability all at once. Until you can answer it cleanly, every new feature you bolt on is going to inherit the ambiguity.
Async systems reward the teams that think these things through up front. They punish the ones that don't.