Back to blog

Retries, backoff, and dead-letter queues

Backend Architecture Notes: recover when you can, stop when retrying harms, make failure visible


In event-driven systems, failure is normal.

A consumer may fail to process a message.

A database may be temporarily unavailable.

An external API may time out.

A message may contain invalid data.

A deployment may restart workers halfway through processing.

The question is not whether failures will happen. They will.

The real question is:

What should the system do when processing fails?

Retries, backoff, and dead-letter queues are part of that answer.

They help a system recover from temporary failures, avoid making outages worse, and isolate messages that cannot be processed safely.

The simple idea of retries

A retry means trying the same operation again after it failed.

For example, a consumer receives an event:

PaymentSucceeded

It tries to update the Order Service database.

But the database connection fails.

The consumer can retry later.

This makes sense because many failures are temporary.

Examples:

Database connection timeout
Temporary network issue
External API rate limit
Message broker rebalance
Short deployment interruption
Service restart

If the system gives up immediately, small temporary problems become business failures.

Retries make distributed systems more resilient.

But retries can also be dangerous if they are not designed carefully.

Not every failure should be retried

One mistake is to retry every failure forever.

That sounds safe, but it can cause serious problems.

Some failures are temporary.

Some failures are permanent.

For example, this may be temporary:

Payment provider timeout

But this is probably permanent until the message or code changes:

Message payload is missing a required field

This may be temporary:

Database is overloaded

But this may be permanent:

Event schema version is unsupported

If a message can never be processed successfully, retrying it forever does not help.

It just wastes resources and may block other work.

A good system distinguishes between:

Transient failures
Permanent failures
Unknown failures

Transient failures should usually be retried.

Permanent failures should usually be rejected, moved aside, or sent to a dead-letter queue.

Unknown failures need limits, logging, and visibility.

Immediate retries can make outages worse

Imagine a database is temporarily overloaded.

A consumer tries to process a message and fails.

Then it retries immediately.

It fails again.

Then it retries immediately again.

Now multiply that by thousands of messages and many consumer instances.

The system may create a retry storm.

Database is struggling
Consumers retry aggressively
Database gets even more traffic
More requests fail
More retries happen
The outage gets worse

Retries are meant to help recovery.

Bad retries can prevent recovery.

That is why backoff matters.

Backoff

Backoff means waiting before retrying.

Instead of retrying immediately, the system waits a bit.

For example:

First retry after 1 second
Second retry after 5 seconds
Third retry after 30 seconds
Fourth retry after 2 minutes

This gives the failing dependency time to recover.

It also reduces pressure on the system.

A common approach is exponential backoff:

1 second
2 seconds
4 seconds
8 seconds
16 seconds

In practice, systems often add a maximum delay so retries do not grow forever.

For example:

Retry with exponential backoff up to 5 minutes

Backoff is one of those small operational details that makes systems much more stable.

Jitter

If many consumers retry at exactly the same schedule, they may all retry together.

For example:

10,000 messages fail at 10:00:00
All retry after 10 seconds
All retry again at 10:00:10

This creates traffic spikes.

Jitter means adding randomness to the retry delay.

Instead of every message retrying after exactly 10 seconds, each message retries after a slightly different delay.

For example:

Retry after 8.3 seconds
Retry after 11.7 seconds
Retry after 9.1 seconds
Retry after 12.4 seconds

This spreads the load.

Jitter is especially useful when many workers are handling many messages at the same time.

Retry limits

Retries should usually have limits.

For example:

Try 5 times
Then move the message to a dead-letter queue

or:

Retry for 30 minutes
Then mark the workflow as failed

The right limit depends on the business process.

For some systems, a few retries are enough.

For others, it may be reasonable to retry for hours.

For example, if a third-party reporting system is down, retrying for a long time may be fine.

If a payment status is unknown, the system may need a more careful process involving status checks and reconciliation.

The important thing is to define the retry policy intentionally.

Do not retry forever by accident.

Dead-letter queues

A dead-letter queue, often called a DLQ, is where messages go when they cannot be processed after retries.

For example:

Consumer receives message
Processing fails
Message is retried several times
Processing still fails
Message is moved to dead-letter queue

The dead-letter queue keeps the bad message from blocking the main flow.

It also gives engineers or operators a place to inspect what went wrong.

A DLQ is not a trash can.

It is an operational tool.

Messages in the DLQ should be visible, investigated, and either fixed, replayed, or intentionally discarded.

Why dead-letter queues matter

Without a DLQ, one bad message can cause a lot of trouble.

Imagine a consumer receives an event with an invalid payload.

Every time it tries to process the event, it crashes.

If the broker keeps redelivering the same message, the consumer may get stuck.

Other messages may be delayed.

The system may keep producing errors.

The team may get noisy alerts without a clean way to isolate the problem.

A DLQ gives the system a way to say:

This message could not be processed safely.
Move it aside.
Continue processing other messages.
Alert someone.

That is much better than retrying forever.

A dead-letter queue needs ownership

A dead-letter queue is only useful if someone owns it.

A common failure mode is this:

Messages are moved to the DLQ
No one checks the DLQ
The number grows silently
Business processes remain broken

That is dangerous.

If messages are important enough to be placed in a DLQ, they are important enough to monitor.

A team should know:

Who owns this DLQ?
What alert should fire?
How quickly should messages be investigated?
How are messages replayed?
When is it safe to discard a message?

A DLQ without ownership is just hidden data loss.

Poison messages

A poison message is a message that consistently causes processing to fail.

For example:

Invalid schema
Unexpected null value
Unsupported event version
Business invariant violation
Payload too large
Reference to missing data

A poison message will probably not succeed just because it is retried.

It needs a code fix, data fix, schema fix, or manual decision.

Dead-letter queues help isolate poison messages.

But the system should also capture useful context:

Original message
Error message
Stack trace or failure reason
Consumer name
Retry count
First failure time
Last failure time
Correlation ID
Business identifiers

This makes investigation easier.

A DLQ message without enough context can be painful to debug.

Retrying external APIs

External APIs are one of the most common sources of retry complexity.

For example:

Payment providers
Email providers
Shipping providers
Fraud detection services
CRM integrations
Analytics platforms

If an external API times out, the result may be unclear.

Did the request fail?

Or did it succeed, but the response was lost?

This distinction matters.

For example, with payments, blindly retrying can be dangerous.

Send charge request
Request times out
Retry charge request
Customer may be charged twice

For this kind of operation, retries must be combined with idempotency keys or status checks.

A safer flow is:

Send payment request with idempotency key
If timeout happens, retry with the same key
Or check payment status by payment ID

The retry strategy depends on the side effect.

Retrying a read request is usually less dangerous than retrying a write request.

Retrying database operations

Database failures can often be retried, but not blindly.

For example, retrying after a temporary connection issue may be fine.

Retrying after a unique constraint violation may not be.

A unique constraint violation may mean:

This operation already happened

or:

The input is invalid

In an idempotent consumer, a unique constraint violation on a processed operation key may be expected.

It can mean the message is a duplicate.

That should probably be handled differently from a real database outage.

This is why error classification matters.

Not all errors are equal.

Retrying messages vs retrying operations

There is a subtle difference between retrying a message and retrying a business operation.

Suppose a consumer receives:

OrderCreated

It needs to call the Payment Service.

If the call fails, should the whole message be retried?

Or should the business process move into a state like:

PaymentAuthorizationPending

and let a workflow retry the payment step?

For simple consumers, message-level retries may be enough.

For important workflows, it can be better to model retries as part of the business process.

For example:

Payment authorization failed temporarily
Retry scheduled for 10 minutes later
Order remains PendingPayment

This makes the state visible.

It also avoids hiding important business progress inside a message broker retry mechanism.

Delayed retries

Some systems support delayed messages or retry topics.

Instead of immediately redelivering a failed message, the message is placed somewhere until the retry time.

For example:

main queue
retry queue after 1 minute
retry queue after 5 minutes
retry queue after 30 minutes
dead-letter queue

This can be useful because it prevents failing messages from blocking active processing.

It also makes retry stages visible.

The exact mechanism depends on the broker, but the idea is the same:

Do not hammer a failing dependency.
Delay retries.
Escalate after limits.

Do not hide business failures in technical retries

Retries are useful for technical failures.

But some failures are business outcomes.

For example:

Payment declined
Inventory not available
User not eligible for bonus
Subscription already cancelled

These should not be retried like temporary errors.

A declined payment is not the same as a payment provider timeout.

An out-of-stock result is not the same as an inventory service outage.

A good system separates:

Technical failure
Business rejection
Unknown result

Each needs a different response.

Technical failures may be retried.

Business rejections usually move the process to a clear failure state.

Unknown results may require status checks, reconciliation, or manual review.

Retries require idempotency

Retries and idempotency are connected.

If an operation can be retried, it must be safe to retry.

For example:

Reserve inventory for order_123
Authorize payment for order_123
Grant bonus bonus_456
Send notification notification_789

Each operation should have a key that identifies it.

If the retry runs again, the system should know whether the operation already happened.

Without idempotency, retries can turn temporary failures into duplicate business effects.

That is one of the most important lessons in message-driven systems.

Retries require observability

Retries should be visible.

If a system retries silently, it may hide a real incident.

Useful metrics include:

Retry count per consumer
Retry count per message type
Oldest retry age
Retry success rate
Dead-letter queue size
Dead-letter queue age
Top failure reasons
External API timeout rate
Messages stuck in retry state

Logs should include:

messageId
eventType
correlationId
business identifiers
retry attempt
error type
next retry time

This makes it possible to answer:

Are retries normal?
Are they increasing?
Which dependency is failing?
Are messages eventually succeeding?
Are business workflows stuck?

A retry strategy without observability is incomplete.

Replaying messages from a DLQ

Eventually, someone will want to replay messages from a dead-letter queue.

This can be useful after a bug fix.

For example:

A consumer failed because it did not support eventVersion 2.
The code is fixed.
Now the failed messages can be replayed.

But replaying messages should be done carefully.

Before replaying, ask:

Is the consumer idempotent?
Will replaying cause duplicate side effects?
Has the bad data been fixed?
Was the original failure transient or permanent?
Should all messages be replayed or only selected ones?
Do we need to preserve order?

A replay button is powerful.

It can also be dangerous.

I prefer replay tools that make the operator choose intentionally, rather than automatically dumping everything back into the main queue without context.

When to alert

Not every retry needs an alert.

If one message fails once and then succeeds, that is normal.

Alerting on every small retry creates noise.

Better alerts are based on symptoms that matter.

For example:

DLQ has messages older than 5 minutes
Retry rate increased sharply
Oldest unprocessed message is too old
Consumer lag is growing
Payment workflows stuck in Pending state
Outbox publisher cannot publish events
External provider timeout rate exceeds threshold

Business alerts are often more useful than purely technical alerts.

For example:

100 orders are stuck waiting for payment confirmation

is more actionable than:

Consumer error count is high

Both can be useful, but business impact should be visible.

A practical retry policy

A simple retry policy might look like this:

Try immediately once
Retry after 10 seconds
Retry after 1 minute
Retry after 5 minutes
Move to DLQ after 5 attempts
Alert if DLQ is not empty for more than 10 minutes

For another system, the policy may be different.

For example, a payment status check may retry for longer:

Retry after 1 minute
Retry after 5 minutes
Retry after 15 minutes
Retry after 1 hour
Escalate to manual review

The policy should match the business.

Do not copy retry settings blindly.

A notification email, a payment authorization, and an inventory reservation do not all need the same retry behavior.

The interview version

If I had to explain retries, backoff, and dead-letter queues in an interview, I would say:

In event-driven systems, message processing can fail because of temporary issues like database timeouts, service restarts, network problems, or external API failures. Retries help the system recover from those temporary failures.

But retries need limits and backoff. If every consumer retries immediately, a temporary outage can turn into a retry storm and make the dependency even less healthy. So I would use exponential backoff, often with jitter, and a maximum number of attempts.

If a message still cannot be processed after the retry policy, I would move it to a dead-letter queue. A DLQ lets the system isolate poison messages, continue processing other work, and give engineers a place to inspect, fix, and replay failed messages.

I would also make sure retries are safe. That means handlers need to be idempotent, especially for operations like payments, inventory reservations, bonuses, invoices, and external API calls. I would monitor retry rates, DLQ size, oldest DLQ message age, consumer lag, and business workflows stuck in pending states.

Final thought

Retries are not just a technical checkbox.

They are part of the reliability design of the system.

A good retry strategy helps the system recover.

A bad retry strategy can overload dependencies, hide business failures, duplicate side effects, or leave important messages stuck forever.

Backoff gives failing systems room to recover.

Dead-letter queues isolate messages that need attention.

Idempotency makes retries safe.

Observability tells you whether the system is healthy.

Together, these patterns make event-driven systems much more production-ready.

The goal is not to retry everything forever.

The goal is to recover when recovery is possible, stop when retrying is harmful, and make failures visible when the system needs human attention.

This post is part of my Backend Architecture Notes series. In the next post, I will look at ordering problems in event-driven systems, and why messages do not always arrive in the order you expect.