Building a Scalable Multi-Channel Notification Service
A practical architecture for a notification service that delivers across email, SMS, push, and in-app channels. It covers user preferences, rate-limiting, synchronous and batch delivery, queueing with retries, high availability, and the trade-offs between latency, cost, and reliability.
Why notifications are harder than they look
A "send a notification" feature starts as one line of code and ends as a distributed system. The moment you support more than one channel, more than a handful of users, and any expectation of reliability, you are building a notification service — with its own data model, queues, retries, and failure modes.
This post walks through the architecture of a scalable notification service that delivers across email, SMS, push, and in-app channels. We cover user preferences, rate-limiting, synchronous vs. batch delivery, queueing with retries, high availability, and the trade-offs between latency, cost, and reliability that shape every decision.
A channel-agnostic core
The first design rule: callers should not know or care how a notification is delivered. A service emitting order.shipped should not contain SMS API keys or email templates. Instead, producers publish a logical notification intent:
{
"event": "order.shipped",
"user_id": "u_12345",
"data": { "order_id": "A-1001", "eta": "2026-06-04" },
"priority": "normal"
}
The notification service then resolves which channels to use, which template to render, and whether the user even wants it. This separation lets you add a new channel (say, WhatsApp) without touching a single producer.
Internally, the service is built around a few clean abstractions:
- Notification — the logical intent above.
- Channel adapter — a uniform interface (
send(rendered, recipient)) implemented per provider. - Template — channel-specific rendering of the same event.
- Recipient resolver — maps
user_idto concrete addresses (email, phone, device tokens).
producer → intent → [preferences] → [rate-limit] → [template] → queue → channel adapter → provider
Modeling the four channels
Each channel has different constraints, and the architecture must respect them rather than flatten them:
| Channel | Latency expectation | Cost | Failure mode |
|---|---|---|---|
| In-app | Instant | Near-zero | User offline (store + show later) |
| Push | Seconds | Cheap | Stale device token, silent drop |
| Seconds–minutes | Cheap | Bounce, spam folder, deferral | |
| SMS | Seconds | Expensive | Carrier rejection, invalid number |
Each channel adapter implements the same interface but encapsulates provider quirks — APNs/FCM for push, an SMTP or transactional-email API for email, a carrier gateway for SMS, and a database write plus websocket push for in-app.
class ChannelAdapter(Protocol):
async def send(self, message: RenderedMessage, recipient: Recipient) -> DeliveryResult:
...
A DeliveryResult is normalized: delivered, retryable_failure, or permanent_failure. The orchestration layer never needs provider-specific error handling — the adapter translates carrier-speak into those three outcomes.
User preferences come first
Before anything is rendered or queued, the service consults preferences. This is both a product requirement (users hate spam) and a cost lever (you don't pay for SMS nobody wants).
A preference model usually has three layers:
- Category opt-in — does the user want
marketing,transactional, orsecuritynotifications? - Channel preference — for this category, which channels are allowed?
- Quiet hours / frequency caps — time windows and digest preferences.
{
"user_id": "u_12345",
"categories": {
"transactional": { "email": true, "push": true, "sms": false },
"marketing": { "email": true, "push": false, "sms": false }
},
"quiet_hours": { "tz": "Africa/Cairo", "from": "22:00", "to": "08:00" }
}
A critical rule: transactional and security notifications usually bypass marketing opt-outs but still respect channel-level addressability (you can't SMS a user with no phone number). Encode this as policy, not scattered if statements.
Rate-limiting and deduplication
Rate-limiting protects three things: the user's attention, your provider quotas, and your bill. There are several limits worth enforcing at distinct scopes:
- Per-user, per-category — e.g. at most 5 marketing pushes/day.
- Per-event dedup — collapse duplicate
order.shippedevents fired twice within a window. - Global per-provider — stay under the carrier's throughput ceiling.
A token-bucket algorithm in Redis handles per-user limits cheaply:
key = ratelimit:{user_id}:{category}
allow = redis.call('CL.THROTTLE', key, max_burst, count, period)
Deduplication uses an idempotency key derived from (event, user_id, business_id) stored with a TTL. If the key already exists, the notification is dropped before it ever reaches a queue — saving compute and money.
Synchronous vs. batch delivery
Not every notification has the same urgency, and treating them uniformly wastes either latency or money.
Synchronous (real-time) path. Security codes, password resets, and OTPs must go out now. These skip batching, take a high-priority queue, and accept higher per-message cost for low latency. The producer may even await a delivery acknowledgment.
Batch path. Digests, newsletters, and non-urgent updates are accumulated and flushed on a schedule. Batching unlocks real savings: many email providers bill less per message in bulk, and a daily digest of 20 events becomes one email instead of twenty.
priority = high → low-latency queue → immediate worker
priority = bulk → aggregation window → digest builder → batch send
The decision is driven by the notification's priority and category, resolved by policy — not hard-coded per producer.
Queueing with retries
Queues are the backbone of reliability. Putting a durable queue (Kafka, SQS, RabbitMQ, or Redis Streams) between the API and the channel workers decouples accepting a notification from delivering it. The API can return 202 Accepted in milliseconds; delivery happens asynchronously.
Retries are where most naive implementations fail. The rules that matter:
- Exponential backoff with jitter — retry at 1s, 2s, 4s, 8s… plus randomness to avoid thundering herds.
- Bounded attempts — give up after N tries and route to a dead-letter queue (DLQ) for inspection.
- Distinguish retryable vs. permanent — a
500from the SMS gateway is retryable; an "invalid phone number" is not. Retrying permanent failures just burns money. - Idempotent sends — pass an idempotency key to the provider so a retry after a timeout doesn't double-send.
async def deliver(msg):
for attempt in range(MAX_ATTEMPTS):
result = await adapter.send(msg.rendered, msg.recipient)
if result.status == "delivered":
return record_success(msg)
if result.status == "permanent_failure":
return dead_letter(msg, result.reason)
await asyncio.sleep(backoff(attempt) + jitter())
dead_letter(msg, "max_attempts_exhausted")
The DLQ is not a graveyard — it's an operational signal. A spike in dead-lettered SMS often means a misconfigured number format or a carrier outage, and it should page someone.
High availability
A notification service is frequently on the critical path for security flows, so availability matters. Several patterns keep it up:
- Stateless workers — scale horizontally; any worker can process any queue message.
- Multi-region queues and DBs — survive a regional outage; replicate preference data.
- Provider failover — keep a secondary email/SMS provider and fail over when the primary's error rate crosses a threshold. This is where a normalized
DeliveryResultpays off again. - Circuit breakers — when a provider is clearly down, stop hammering it; shed to the secondary or pause that channel and let the queue absorb the backlog.
- Backpressure — if workers can't keep up, the durable queue buffers rather than dropping. Monitor queue depth as a leading indicator.
Crucially, degrade gracefully: if push is down, an in-app notification still lands, and the user sees it next time they open the app. No single channel failure should lose the message.
The core trade-off: latency vs. cost vs. reliability
Almost every decision in this system is a point on a triangle:
- Latency. Synchronous delivery, dedicated high-priority queues, and provider redundancy reduce latency — at higher cost.
- Cost. Batching, digesting, deduplication, and aggressive preference filtering cut cost — at the price of latency or completeness.
- Reliability. Retries, DLQs, multi-provider failover, and replication raise reliability — at the cost of complexity and infrastructure spend.
You don't pick one globally; you pick per notification class:
| Class | Latency | Cost | Reliability |
|---|---|---|---|
| OTP / security | Lowest | Accept | Highest |
| Transactional | Low | Medium | High |
| Marketing/digest | Relaxed | Lowest | Best-effort |
Encoding this as policy keyed on category and priority is what keeps the service both cheap and trustworthy — the OTP path spends what it must, and the newsletter path spends as little as possible.
Putting it together
A scalable notification service is, at its heart, a small pipeline with strong contracts:
- Accept a logical intent and return fast.
- Resolve preferences, addressability, and rate limits.
- Render per channel and route by priority.
- Deliver through durable queues with bounded, idempotent retries.
- Observe everything — delivery rates, DLQ depth, provider error rates, cost per channel.
Start with the channel-agnostic core and the preference model; those two decisions determine how cleanly everything else composes. The queues, retries, and failover are well-understood patterns — the hard part is keeping the boundaries clean so that adding a channel, swapping a provider, or tightening a budget is a localized change, not a rewrite.
Rate this post
All fields are optional. Just stars is fine.