In a simple application, a business process can often be handled inside one database transaction.
A request comes in. The application validates the input, updates a few tables, commits the transaction, and returns a response.
If something fails, the transaction rolls back.
That model is easy to understand.
But in a distributed system, especially one built with multiple services, things get more complicated.
Each service often owns its own data. The Order Service owns orders. The Payment Service owns payments. The Inventory Service owns stock. The Shipping Service owns shipments.
That means one business process may need to cross multiple service boundaries.
The Saga pattern is one way to handle that.
Imagine a customer places an order.
The system needs to do several things:
Create the order
Reserve inventory
Authorize payment
Confirm the order
Start shipping
Send confirmation
In a monolith with one database, you might be able to wrap part of this in a single transaction.
But in a microservice architecture, these steps may belong to different services:
Order Service
Inventory Service
Payment Service
Shipping Service
Email Service
Each service has its own database and its own business rules.
So the question becomes:
How do you keep the overall business process correct when there is no single transaction around everything?
That is where sagas are useful.
A saga is a way to coordinate a long-running business process across multiple services using a sequence of local transactions.
Each step updates one service.
If all steps succeed, the business process completes.
If one step fails, the system executes compensating actions to undo or correct earlier steps.
The important part is that a saga is not one big technical transaction.
It is a business process.
Instead of this:
Begin global transaction
Create order
Reserve stock
Charge payment
Commit global transaction
A saga looks more like this:
Create order
Reserve stock
Authorize payment
Confirm order
And if something goes wrong:
Create order
Reserve stock
Payment fails
Release stock
Cancel order
Each step is committed locally. If a later step fails, the system uses compensating actions.
This is one of the most important parts of the Saga pattern.
A compensating action is not a database rollback.
It is a business action that corrects or reverses something that already happened.
For example:
If inventory was reserved, release the reservation.
If payment was authorized, void the authorization.
If payment was captured, issue a refund.
If an order was created, cancel the order.
If a shipment was created, cancel the shipment if possible.
These are real business operations.
They may have their own rules. They may fail. They may need retries. They may not perfectly undo the original action.
That is why the Saga pattern forces you to think in business terms.
A database rollback pretends nothing happened.
A compensating action says something did happen, and now we need to correct it.
That distinction matters a lot in production systems.
Let's use a simple order flow.
A customer buys a product.
The ideal flow is:
1. Create order
2. Reserve inventory
3. Authorize payment
4. Confirm order
5. Start shipping
The state may move like this:
OrderCreated
InventoryReserved
PaymentAuthorized
OrderConfirmed
ShippingStarted
But not every order succeeds.
Payment might fail.
Inventory might be unavailable.
Shipping might not be possible.
A saga needs to define what happens in those cases.
For example:
Order created
Inventory reserved
Payment failed
The system now needs to compensate:
Release inventory
Cancel order
Notify customer
Another example:
Order created
Payment authorized
Inventory reservation failed
The compensation might be:
Void payment authorization
Cancel order
Notify customer
The exact behavior depends on the business.
That is why you cannot design a saga only from a technical diagram. You need to understand what the business wants to happen when each step fails.
A saga is made of local transactions.
A local transaction is a transaction inside one service.
For example, the Order Service may do this:
Begin transaction
Insert order
Set status to PendingInventory
Commit transaction
Publish OrderCreated
The Inventory Service may do this:
Begin transaction
Reserve stock
Store reservation
Commit transaction
Publish InventoryReserved
The Payment Service may do this:
Begin transaction
Authorize payment
Store payment state
Commit transaction
Publish PaymentAuthorized
Each service protects its own data.
The saga coordinates the larger process.
You might ask:
Why not just use a distributed transaction across all services?
In some systems, distributed transactions exist. But they are often avoided in microservice architectures because they create tight coupling between services and infrastructure.
They can make services less independent. They can be harder to operate. They can cause availability problems. They can also force all participants to agree on the same transaction protocol.
That does not mean distributed transactions are always wrong. But in many service-oriented systems, sagas are preferred because they match the reality of independent services better.
Each service owns its own data and commits its own changes.
The cost is that the system becomes eventually consistent.
A saga accepts that and makes the process explicit.
A saga should have clear state.
For example:
Started
OrderCreated
InventoryReserved
PaymentAuthorized
Confirmed
Cancelling
Cancelled
Failed
RequiresManualReview
This is important because long-running workflows can get stuck.
A service might be down.
A message might fail.
A consumer might crash.
An external payment provider might time out.
If the saga state is not visible, debugging becomes painful.
You want to know:
Which step are we in?
Which steps already succeeded?
Which compensating actions still need to run?
How long has the process been stuck?
Can we retry safely?
Does this need manual intervention?
A saga should not be a hidden chain of side effects that only exists in logs.
It should be observable.
There are two common ways to implement sagas.
The first is choreography.
The second is orchestration.
They both coordinate a business process, but they do it differently.
In a choreography-based saga, there is no central coordinator.
Each service reacts to events and publishes new events.
For example:
Order Service publishes OrderCreated
Inventory Service listens and reserves stock
Inventory Service publishes InventoryReserved
Payment Service listens and authorizes payment
Payment Service publishes PaymentAuthorized
Order Service listens and confirms the order
The flow emerges from the services reacting to each other.
This can work well for simple processes.
It keeps services decoupled and avoids a central workflow component.
But there is a downside.
As the process grows, the business flow becomes harder to see.
The logic is spread across multiple services.
To understand the full order process, you may need to inspect the Order Service, Inventory Service, Payment Service, Shipping Service, and several event handlers.
This can become difficult to reason about.
Choreography is simple at first, but it can become messy when the workflow becomes complex.
In an orchestration-based saga, one component coordinates the workflow.
This can be a service, a workflow engine, or a dedicated saga orchestrator.
For example:
Order Saga Orchestrator creates order
Order Saga Orchestrator asks Inventory Service to reserve stock
Order Saga Orchestrator waits for result
Order Saga Orchestrator asks Payment Service to authorize payment
Order Saga Orchestrator waits for result
Order Saga Orchestrator confirms or cancels the order
The orchestrator knows the process.
It decides what step comes next.
It also decides what compensating action to run when something fails.
This makes the business process easier to understand and monitor.
The downside is that the orchestrator can become too central if it starts owning logic that belongs inside the individual services.
A good orchestrator coordinates.
It should not become a god service.
For complex business-critical workflows, I usually prefer orchestration because failure handling and visibility are easier.
For very simple flows, choreography can be enough.
Sagas often use both commands and events.
The saga sends commands to ask services to do something.
Services publish events to report what happened.
For example:
Command: ReserveInventory
Event: InventoryReserved
Event: InventoryReservationFailed
Command: AuthorizePayment
Event: PaymentAuthorized
Event: PaymentFailed
Command: CancelOrder
Event: OrderCancelled
This keeps responsibilities clear.
The orchestrator asks for an action.
The service that owns the action decides whether it succeeds.
The service then publishes the result.
That result moves the saga forward.
Failure handling is the real reason sagas exist.
A happy path is easy to draw.
The important design work is deciding what happens when something fails.
For each step, ask:
What can go wrong?
Can this step be retried?
Is the operation idempotent?
What previous steps need to be compensated?
What if compensation fails?
When do we stop retrying?
When do we need manual intervention?
For example, if payment authorization fails, maybe the system can cancel the order automatically.
But if a refund fails after payment was captured, you may need manual intervention.
Not every failure can be solved automatically.
A mature system knows the difference between:
Retry this later
Compensate automatically
Stop and alert support
Escalate to manual review
Sagas rely heavily on idempotency.
Messages may be delivered more than once.
Commands may be retried.
External APIs may timeout even though the operation succeeded.
Consumers may crash after committing a local transaction but before acknowledging a message.
So each step in a saga should be safe to retry.
For example:
Reserve inventory for order_123
should not reserve the same stock twice if the command is delivered twice.
Authorize payment for order_123
should not charge the customer twice.
Cancel order_123
should not produce inconsistent state if called again.
This usually means using business idempotency keys.
For example:
orderId
paymentId
reservationId
sagaId
The system should recognize that it already handled the same business operation.
Without idempotency, retries become dangerous.
And without retries, distributed systems become fragile.
In a saga, not every failure arrives as a clean event.
Sometimes nothing happens.
For example:
Payment provider does not respond
Inventory Service is unavailable
Shipping Service times out
A message is delayed
A saga needs to handle timeouts.
A timeout does not always mean the operation failed.
It may mean the result is unknown.
That is an important difference.
For example, if a payment request times out, you should not blindly retry in a way that might charge the customer twice. You may need to check the payment status using an idempotency key.
Timeouts often lead to states like:
PaymentStatusUnknown
WaitingForPaymentConfirmation
RequiresManualReview
Again, the business process needs to be explicit.
Sometimes teams build sagas without realizing it.
A service publishes an event.
Another service reacts.
Then another service reacts.
Then another one compensates something.
Nobody calls it a saga, but the system contains a long-running distributed workflow.
The danger is that no one owns the full process.
When something breaks, each service looks fine in isolation, but the business process is stuck.
That is why I like to make sagas explicit when the workflow matters.
If a process crosses multiple services, has multiple steps, and needs compensation, it deserves a clear design.
A saga needs good observability.
Useful things to track include:
Saga ID
Current saga state
Current step
Completed steps
Failed steps
Retry count
Compensation status
Time spent in each state
Stuck workflows
Manual intervention queue
Logs should include correlation IDs and business identifiers.
For example:
sagaId
orderId
paymentId
reservationId
customerId
Dashboards should show business progress, not only technical metrics.
It is useful to know that a queue has lag.
It is more useful to know that 142 orders are stuck waiting for payment confirmation.
Testing sagas means testing more than the happy path.
The happy path is important:
Order created
Inventory reserved
Payment authorized
Order confirmed
But the failure paths are where the value is.
Test cases should include:
Inventory reservation fails
Payment authorization fails
Payment succeeds but confirmation event is delayed
Consumer receives the same event twice
Compensating action fails
External API times out
Message arrives out of order
Saga is resumed after a crash
A saga that only works when every service is healthy is not very useful.
The point of the pattern is to survive partial failure.
If I had to explain the Saga pattern in an interview, I would say:
The Saga pattern is used to coordinate a long-running business process across multiple services without using one distributed transaction. Each service performs a local transaction and publishes the result. If a later step fails, the system runs compensating actions, like releasing reserved inventory, voiding a payment authorization, or cancelling an order.
The important part is that compensation is not a technical rollback. It is a business operation that corrects something that already happened.
There are two common styles. In choreography, services react to each other's events. This keeps things decoupled, but complex flows can become hard to understand. In orchestration, a central workflow or orchestrator coordinates the steps. I prefer orchestration for complex business-critical workflows because it makes the process, retries, timeouts, and compensations easier to see.
In production, I would make each step idempotent, track saga state clearly, handle retries and timeouts, and monitor stuck workflows. The hard part is not the happy path. The hard part is defining what happens when each step fails.
The Saga pattern exists because real business processes do not always fit inside one technical transaction.
Orders, payments, inventory, shipping, subscriptions, refunds, and onboarding flows often span multiple services and take time to complete.
A saga makes that process explicit.
It accepts that each service owns its own data. It accepts that the system is eventually consistent. It accepts that failures happen halfway through.
Then it asks the important question:
What should the business do next?
That is why I like the pattern.
It is not just a messaging pattern. It is a way to model business workflows in a distributed system.
This post is part of my Backend Architecture Notes series. In the next post, I will look at Saga choreography vs orchestration in more detail, and how to choose between them.