Resilient Webhooks in Multi-Tenant Systems

Resilient Webhooks in Multi-Tenant Systems

Author avatar

Senior Developer

Paul Cushing

Published

3/25/2026

The boring infrastructure decisions that separate webhooks that work from webhooks that wake you up at 2 AM.

Webhooks are one of those things that look trivially easy on a whiteboard. You send an HTTP POST. The other side responds 200. Done.

Then you run it in production for a month, and reality shows up. The receiver was slow. A retry storm took down a downstream service. One particularly active tenant hogged the worker pool and everybody else's webhooks stopped going out. A customer complained that their integration received the same event nine times.

I've debugged enough of these to stop trusting the "it's just an HTTP POST" framing. Here's what I now consider table stakes for webhook infrastructure that actually has to work.

Isolate tenants from each other

If your product is multi-tenant, the first question about webhooks isn't "how do we deliver them," it's "how do we stop one tenant from ruining delivery for everyone else."

In practice that means: partition the delivery work. At minimum, apply per-tenant concurrency limits so no single customer can exhaust the worker pool. At scale, you probably want per-tenant queues, so a backed-up receiver on one tenant doesn't slow the pipeline for anybody else.

The wrong version of this is "we'll just add more workers." That's a tax on everybody else for one tenant's problem. Isolation fixes the root cause.

Sign everything, rotate the secrets

Every webhook payload I send goes out with a signature. The receiver verifies it. If the signature doesn't match, the event gets dropped. This isn't complicated to implement — HMAC over the body with a shared secret is enough for most cases — and the value is enormous.

The piece people miss is rotation. Secrets get leaked. Engineers leave. Laptops go missing. A webhook signing secret that hasn't been rotated in two years is a loaded weapon pointing at your customers.

Build key rotation into the system from the start. Support multiple active signing keys during the rotation window so receivers can update on their own schedule. It's a small amount of extra work up front that saves an embarrassing incident response later.

Design for replay, not just delivery

The thing I most want in a webhook system at 2 AM is the ability to replay. A downstream integration went down for an hour. Ten thousand events got dropped. Now what?

If the answer is "now we ask engineering to write a one-off script," you're in for a long night. If the answer is "now we open the operator tool, filter by tenant and time range, and click replay," you're home in twenty minutes.

That distinction is worth the week it takes to build the replay tooling. Dead letter queues, replay windows, a simple admin UI — none of these are exciting features, but they're the difference between a routine operational task and an incident.

Monitor delivery like it's revenue

The last thing: treat webhook delivery health as a top-line metric. Failed deliveries, unacked events, queue depth per tenant, p99 delivery latency — these are not nerd stats. For any integration-heavy product, they're customer experience metrics.

A webhook that fails silently is worse than one that fails loudly, because the customer finds out from their users instead of from you. Build the alerting before you need it.

Webhooks are simple to ship and hard to run well. Invest in the boring parts.

Systems DesignBoise TechArchitectureTreasure Valley

Related Insights

View Archive