Back to blog

Observability in event-driven systems

Backend Architecture Notes: see the flow when there is no single request path


Event-driven systems are harder to debug than traditional synchronous systems.

In a simple synchronous request, you can often follow one path.

A user sends a request. The API handles it. The service calls a database. Maybe it calls another service. Then it returns a response.

The path may still be complicated, but there is usually one visible request flow.

In an event-driven system, the flow is different.

A service publishes an event. One consumer processes it. Another consumer processes it later. A third consumer fails and retries. A fourth consumer updates a projection. Another service publishes a new event. A saga moves to the next state.

The business process is now spread across services, brokers, queues, topics, databases, retries, and background workers.

That is why observability matters.

Without observability, an event-driven system can feel like a black box.

The basic problem

Imagine a user places an order.

The Order Service publishes:

OrderCreated

Then several things should happen:

Inventory should be reserved
Payment should be authorized
The order should be confirmed
A confirmation email should be sent
Analytics should be updated

Now imagine the user contacts support and says:

I placed an order, but it is still pending.

Where do you look?

Did the Order Service publish the event?

Did the broker receive it?

Did the Inventory Service consume it?

Did inventory reservation fail?

Did the Payment Service receive the next event?

Did payment timeout?

Is a message stuck in a retry queue?

Did the order saga stop halfway?

Did the confirmation event get published?

Did a projection fail to update?

In a synchronous system, you may have one failed request.

In an event-driven system, you may have a business process that is stuck somewhere across multiple components.

Observability is how you find where.

Logs are not enough

Logs are useful, but logs alone are not enough.

A log line might say:

Processed OrderCreated

But that does not answer enough questions.

Which order?

Which event?

Which consumer?

Which attempt?

Which correlation ID?

Was this the first time the event was processed or a retry?

What happened next?

A useful log line in an event-driven system needs context.

For example:

eventType=OrderCreated
eventId=evt_123
orderId=order_456
correlationId=corr_789
consumer=inventory-service
attempt=2
result=InventoryReserved

That kind of log line is much more useful.

It lets you connect technical processing to the business process.

Correlation IDs

A correlation ID connects all the work that belongs to the same business flow.

For example, when a user places an order, the system may create:

correlationId = corr_123

That ID should travel with the events, commands, logs, and traces related to that order flow.

For example:

CreateOrder request
OrderCreated event
ReserveInventory command
InventoryReserved event
AuthorizePayment command
PaymentAuthorized event
OrderConfirmed event
EmailSent event

All of these should include the same correlation ID.

Then, when something goes wrong, you can search for that correlation ID and follow the flow across services.

Without correlation IDs, debugging becomes guesswork.

With correlation IDs, the system becomes traceable.

Causation IDs

A causation ID tells you what caused a specific event or command.

For example:

OrderCreated caused ReserveInventory
ReserveInventory caused InventoryReserved
InventoryReserved caused AuthorizePayment
AuthorizePayment caused PaymentAuthorized

The correlation ID groups the whole business flow.

The causation ID links one step to the previous step.

A simple mental model is:

correlationId = the whole story
causationId = the previous chapter

This can be very helpful when reconstructing event chains.

For example:

{
  "eventId": "evt_inventory_reserved",
  "eventType": "InventoryReserved",
  "correlationId": "corr_order_123",
  "causationId": "cmd_reserve_inventory_456",
  "orderId": "order_123"
}

This tells you that the inventory reservation belongs to the order flow and was caused by a specific command.

Event IDs

Every event should have a unique event ID.

For example:

eventId = evt_123

The event ID is useful for:

Deduplication
Logging
Tracing
Dead-letter queue inspection
Replay
Auditing
Debugging

If a consumer fails to process an event, you should be able to search for that event ID and see what happened.

For example:

Was event evt_123 published?
Which consumers received it?
Which consumer failed?
How many times was it retried?
Was it moved to a dead-letter queue?
Was it replayed?

The event ID gives you a handle on one specific message.

The correlation ID gives you the larger business context.

You usually need both.

Structured logs

In distributed systems, structured logs are much more useful than plain text logs.

Bad:

Order created successfully

Better:

{
  "message": "Order created successfully",
  "orderId": "order_123",
  "eventId": "evt_456",
  "correlationId": "corr_789",
  "service": "order-service",
  "status": "PendingPayment"
}

Structured logs make it easier to search, filter, group, and build dashboards.

You can ask questions like:

Show me all logs for order_123
Show me all failed PaymentSucceeded events
Show me all messages with correlationId corr_789
Show me all retries from inventory-service

That is much harder when important data is hidden inside unstructured strings.

Metrics

Logs help explain individual cases.

Metrics show system health.

In event-driven systems, useful metrics include:

Messages published per event type
Messages consumed per event type
Consumer lag
Processing duration
Retry count
Dead-letter queue size
Oldest message age
Duplicate detection count
Outbox unpublished count
Outbox publish failures
Inbox deduplication count

These metrics help answer operational questions.

For example:

Are consumers keeping up?
Are retries increasing?
Is a DLQ growing?
Is the outbox draining?
Are messages getting older?
Did a deployment increase failures?

Without metrics, you may only notice problems when users complain.

With metrics, you can detect problems before they become visible to users.

Consumer lag

Consumer lag is one of the most important metrics in event-driven systems.

It tells you how far behind a consumer is.

For example, if events are being published faster than a consumer can process them, lag grows.

That may mean:

The consumer is too slow
The consumer is failing and retrying
The event volume increased
A downstream dependency is slow
The consumer is under-provisioned
A deployment introduced a bug

Lag is not always bad.

A short spike may be normal.

But growing lag means the system is falling behind.

For business-critical flows, lag can translate directly into delayed payments, delayed orders, stale projections, or slow notifications.

That is why lag should be visible.

Oldest message age

Message count is useful, but message age is often more important.

For example:

There are 10,000 messages waiting

That sounds bad.

But if they are only a few seconds old and the system is processing them quickly, it may be fine.

Now compare that with:

There is 1 message waiting, and it is 6 hours old

That may be much worse.

An old message can indicate a poison message, blocked partition, stuck workflow, or broken retry process.

For queues, outboxes, retry topics, and dead-letter queues, I like tracking the age of the oldest message.

It often reveals problems that simple counts hide.

Dead-letter queue observability

A dead-letter queue should never be invisible.

Messages in a DLQ usually mean the system gave up processing something automatically.

That needs attention.

Useful DLQ metrics include:

Number of messages in DLQ
Oldest DLQ message age
Messages added to DLQ per hour
Top event types in DLQ
Top failure reasons
Replay success rate

A DLQ is not an archive.

It is a signal that something needs investigation.

A healthy system may occasionally have DLQ messages, but there should be ownership, alerts, and a process for handling them.

A DLQ that nobody checks is just hidden failure.

Outbox observability

If you use the Outbox pattern, the outbox needs monitoring.

Useful metrics include:

Unpublished event count
Oldest unpublished event age
Publish success rate
Publish failure rate
Retry count
Events stuck in failed state
Publishing delay

The most important question is:

Is the outbox draining?

If the outbox stops publishing, the service may continue accepting writes while the rest of the system stops receiving events.

That can create serious downstream issues.

For example:

Orders are created
But payments never start
Inventory is not reserved
Emails are not sent
Analytics is stale

The service itself may look healthy, but the business process is broken.

Outbox metrics make that visible.

Business metrics

Technical metrics are necessary, but they are not enough.

Event-driven systems should also expose business-level metrics.

For example:

Orders stuck in PendingPayment
Payments authorized but orders not confirmed
Inventory reserved but payment failed
Refunds pending longer than expected
Subscriptions cancelled but still active in projection
Emails not sent after OrderConfirmed
Game rounds completed but balance not updated

These metrics are often more useful than low-level broker metrics.

A queue may be healthy, but the business process may still be stuck.

For example, this technical metric is useful:

payment-consumer retry count is high

But this business metric is more meaningful:

87 orders have been waiting for payment confirmation for more than 15 minutes

That tells you the impact.

Tracing

Distributed tracing can help follow a request across services.

In event-driven systems, tracing is more complicated because work is asynchronous.

A trace may start with an HTTP request, continue through event publishing, then continue later when a consumer processes the event.

This requires propagating trace context through messages.

When done well, tracing can show:

Request received
Order stored
OrderCreated published
Inventory consumer processed event
Payment consumer processed event
Order confirmed
Email sent

This is very useful.

But tracing should not be the only observability tool.

Some event-driven flows are long-running. Some steps happen minutes or hours later. Some processes are retried. Some messages are replayed.

Logs, metrics, traces, and business state all need to work together.

Observability in sagas

Sagas need special attention.

A saga is a long-running business process.

That means you should be able to see its state.

For example:

sagaId
orderId
currentStep
currentState
completedSteps
failedStep
retryCount
nextRetryAt
compensationStatus
startedAt
updatedAt

For an order saga, you want to answer:

Is the order waiting for inventory?
Is it waiting for payment?
Did payment fail?
Is compensation running?
Was inventory released?
Does this need manual review?

If the only way to answer those questions is by reading logs, the system is too hard to operate.

Important business workflows deserve visible state.

State transitions should be logged

Whenever a business process changes state, log it.

For example:

Order status changed from PendingPayment to PaymentAuthorized
Order status changed from PaymentAuthorized to Confirmed
Saga state changed from WaitingForInventory to WaitingForPayment

Include identifiers:

orderId
sagaId
eventId
correlationId
previousState
newState
reason

State transition logs are extremely useful when debugging.

They show how the system moved from one state to another.

They also help detect invalid transitions.

For example:

Order changed from Cancelled to Confirmed

That may indicate an out-of-order event, stale message, or missing state validation.

Observability for idempotency

Idempotency should also be observable.

If your consumer detects a duplicate event, that is useful information.

A few duplicates may be normal.

A sudden increase may indicate a problem.

Useful metrics include:

Duplicate messages detected
Idempotency key conflicts
Already-processed operations skipped
Duplicate commands received
Replay-related duplicates

These metrics can reveal:

Producer retry problems
Consumer crashes before acknowledgement
Broker redelivery issues
Manual replay side effects
External API timeout problems

Idempotency protects the system from duplicate effects.

Observability tells you why duplicates are happening.

Observability for ordering problems

Ordering issues should be visible too.

Useful metrics include:

Out-of-order events
Stale events ignored
Missing version gaps
Invalid state transitions
Events delayed waiting for earlier version
Projection rebuilds

For example:

Incoming version: 7
Current version: 5
Missing version: 6

That should not disappear silently.

The system should either retry, fetch current state, rebuild the projection, or alert depending on the importance of the data.

Ordering problems are easier to fix when they are detected explicitly.

Dashboards should match the business flow

A useful dashboard should not only show infrastructure health.

It should show business flow health.

For an order system, a dashboard might show:

Orders created per minute
Orders waiting for payment
Orders waiting for inventory
Orders confirmed per minute
Orders cancelled per reason
Average time from OrderCreated to OrderConfirmed
Orders stuck longer than expected
Payment failures
Inventory failures
Refunds pending

For a gaming platform, a dashboard might show:

Game sessions started
Game rounds completed
Balance updates processed
Failed balance updates
Bonus grants pending
Events stuck in retry
Consumer lag per provider
Oldest unprocessed transaction

This kind of dashboard helps teams understand whether the system is doing what the business needs.

A CPU graph alone will not tell you that.

Alerts should be actionable

Bad alerts create noise.

For example:

Consumer error occurred

That is too vague.

Better alerts include impact and context:

Payment consumer has 250 failed messages in DLQ.
Oldest message is 18 minutes old.
Top error: Unsupported eventVersion 3.

or:

42 orders have been stuck in PendingPayment for more than 15 minutes.
Payment provider timeout rate is 35%.

An alert should help someone decide what to do.

Good alerts usually include:

What is broken?
How big is the impact?
When did it start?
Which service or workflow is affected?
What is the likely cause?
Where can I inspect details?

Alerting is not about knowing that something happened.

It is about knowing that action is needed.

Replay needs observability

If you support replaying events, replay should be visible.

You should know:

Who started the replay?
Which event types are being replayed?
Which time range is being replayed?
Which consumers are affected?
How many events were replayed?
How many failed?
Were side effects disabled?
Was the replay idempotent?

Replay is powerful.

It can rebuild projections and recover from bugs.

It can also create duplicate side effects if used carelessly.

Observability makes replay safer.

Do not rely only on the broker UI

Many brokers provide dashboards or management consoles.

These are useful.

But they usually show technical infrastructure, not full business meaning.

A broker can show:

Queue depth
Consumer count
Message rate
Partition lag

That is important.

But the broker usually cannot tell you:

Which orders are stuck?
Which payments need reconciliation?
Which saga compensations failed?
Which customers are affected?

You need application-level observability too.

The broker knows messages.

Your application knows business meaning.

Both views matter.

Design observability from the start

Observability should not be added only after the first production incident.

By then, the system may already be hard to understand.

When designing an event-driven flow, I like to ask:

How will we know this event was published?
How will we know each consumer processed it?
How will we know the business process completed?
How will we detect stuck workflows?
How will we inspect failed messages?
How will we replay safely?
How will support understand the current state?

These questions should be part of the architecture.

Observability is not separate from design.

It is how the design becomes operable.

The interview version

If I had to explain observability in event-driven systems in an interview, I would say:

Event-driven systems are harder to debug because there is no single synchronous request path. A business flow can move across services, brokers, consumers, retries, outboxes, sagas, and projections. So observability needs to be designed into the system.

I would make sure every event has an event ID, correlation ID, and useful business identifiers like orderId, paymentId, or accountId. Logs should be structured so we can trace a business flow across services.

I would monitor technical metrics like consumer lag, processing time, retry counts, DLQ size, outbox age, and publish failures. But I would also monitor business metrics like orders stuck in pending states, payments authorized but not confirmed, or workflows waiting too long in a saga step.

For important sagas, I would expose the current workflow state instead of forcing engineers to reconstruct everything from logs. Good observability should tell us where the process is stuck, what failed, whether retries are happening, and what business impact it has.

Final thought

Event-driven architecture gives you loose coupling and asynchronous processing.

But it also spreads the business flow across many moving parts.

Without observability, that flexibility becomes confusion.

You need to see the events.

You need to follow the correlation IDs.

You need to know whether consumers are keeping up.

You need to know when retries are happening.

You need to know when messages are dead-lettered.

You need to know when business workflows are stuck.

Logs, metrics, traces, dashboards, and business state all work together.

The goal is not only to know that a message moved through a broker.

The goal is to understand whether the business process completed correctly.

This post is part of my Backend Architecture Notes series. In the next post, I will look at testing event-driven systems, especially how to test duplicates, retries, out-of-order events, and failure paths instead of only testing the happy path.