Saga is a design pattern for ensuring data consistency when implementing update transactions in microservices architectures. We are going to take a concrete scenario of creating vessel registry record, aka “First Registry” and see how saga pattern can be applied to it.
The objective is to get to understand the moving parts, and think of implementation, challenges and implications of using sagas. It will also make it easier to appreciate the utility of third party frameworks such as MassTransit to simplify saga implementation.
Current Architecture and First Registry Business Scenario
Post-MVP Vessel E-Registry architecture rests on the following microservices:
Vessel Registry Service
Vessel Detail Service
Client (aka Contact) Service
Work Management Service
Document Management Service
List of Value Service
Additional services will be added in the future. For example, a service for managing vessel registry history.
The objective of First Registry scenario is to take all information from a completed service request (work item), create a new vessel registry record and mark work item as completed. For simplicity, we are not going to discuss issuance of first registry certificate, and updates to history information here.
Since vessel registry data is spread across 3 services - Client Service, Vessel Detail Service and Vessel Registry Service, we will need to create 3 records in their respective databases, and update 1 record in Work Management Service database to update the work item. We will make an assumption that services would have to be updated in the following order:
Insert new record to client database through client service;
Insert new record to vessel detail database through vessel detail service;
Insert new record to vessel registry service through vessel registry service, using new client Id and vessel detail Id for connecting the first two records to the vessel registry record;
Update status of the work item to “done”.
Schemas of the services' databases evolve over time, so this assumption may or may not hold, however it should not affect our analysis of the problem as regardless of the order of updates the technical challenges we face will still remain the same.
Saga Essentials
When we have a monolith application and insert records in Contacts, Vessel Details and Vessel Registration tables and update Work Items table in its only database, it is simple to make sure that either all 4 tables are updated, or none at all. All we need is to use database transactions. Back to “microservices reality”, where we have all 4 tables living in 4 separate databases and are behind 4 autonomous microservices, we still have to deliver the “all or nothing” behavior, which is so easily achieved in a monolith.
Sagas deliver a compromise between the clear advantages of using microservices architecture and “the price” of infeasibility of building transactional behavior in a distributed system.
The compromise is that our transaction consistency will only be “eventual”. This assumption is what lets us continue the conversation about sagas. If “eventual consistency” is not good enough then Sagas is the wrong design pattern. How can we achieve this?
In simple terms, the services exchange messages about their state. Sagas prescribe that when a record fails to get added in one service, an already created record is “rolled back” in another service. For example, if we have created records in Contact and Vessel Detail tables, but failed to create a record in Vessel Registration table for some reason, we must respond to the failure event and rollback the records from the first two tables, otherwise we would have left our tables in an inconsistent state.
Commonly we talk about sagas as having 2 flavors: choreography and orchestration. See the microservices.io
Let’s list pros and cons of each:
Choreography Pros:
Simple model: each service subscribes to events emitted by other services.
Each service is “smarter” as it is aware of events from surrounding services, even though it does not care about the services themselves.
Choreography Cons:
Saga logic (sequence of happy path calls and rollbacks) is spread across participating services making it hard to understand this logic (it is not in one place).
As the number of saga participants grows, each existing service is obliged to subscribe to new participants' events, which increases “virtual coupling” between them via message formats. This conflicts with single responsibility principle in architecture. For example, in the First Registry scenario, the Work Management Service would need to subscribe to events from Vessel Registry service to participate in the “choreography”, which is a violation this principal.
It is challenging to pass parameters between the services, specifically in a scenario such as First Registry, when we need one set of parameters to insert Contact another set of parameters to insert Vessel Detail, and a third one - to insert the registry, while the saga participants are invoked sequentially.
Orchestration Pros:
It makes it simpler to understand the intended course of happy path and rollback steps, and therefore to develop and maintain such sagas
It is easier to pass parameters, as they always go from the orchestrator down to the saga participants
Orchestration Cons:
There is a need of a new “orchestrator microservice”, which adds complexity to the system.
The orchestrator risks growing into a monolith with all the coordination logic while the saga participants remain “dumb”, as they do not much more than wrap around data repositories.
An orchestrator can be seen as a single point of failure in a saga transaction.
We are going to pick the orchestration-based saga implementation here, as it seems that concern 4a is not a major one, and concern 4b could be addressed by limiting saga orchestrator strictly to dispatcher type logic, and delegate all business rules to saga participants. This would keep it lightweight and manageable even when we have many different sagas to orchestrate. Lastly, to address concern 4c we will look into architecture measures to ensure that saga context would not be lost in the event of the orchestrator’s crash.
Saga Resiliency
Theoretical possibility of a crash of an orchestrator or any of saga participants, and the need to recover from one is one factor that makes saga implementation complicated. Let’s picture the orchestrator-based design supporting First Registry transaction:
There are few things we should emphasize about what’s shown on this diagram:
All operations between microservices and the orchestrator (implemented inside Vessel Registry Controller microservice) are sent via a message broker, or Azure Service Bus (ASB) to be more specific in our case. Events are published to topics, commands are listened on queues. Why use asynchronous messaging? Primary reason is its reliability. ASB topics and queues guarantee that each message will be delivered “at least once” to its consumer. HTTP calls do not provide such guarantee. Our saga progresses asynchronously as commands are sent over queues and get executed by services, which then publish events to topics for the orchestrator to consume them and decide on which next commands to issue to which service.
Each service should be idempotent (the same command should produce the same result at any time). Let us take any microservice, for example Contact Service, and imagine that it has received a message to add a contact record, and has successfully completed the insertion. Now, because message queues may send the same message more than once given their “at least once” guarantee, the next time the service receives the same “Add Contact” message it will execute the insertion again and create a duplicate record. This is both a data corruption situation and a violation of idempotence principle. One common way to counter this is implement an Idempotent Consumer pattern: each time a message arrives, it is logged to a database table, so that the next time when a duplicate message arrives the service could compare it to one in the table and discard it, thereby achieving message deduplication and idempotence. The table storing messages will however need to be “groomed” regularly as the number of messages in it grows over time. This makes us consider using a background Deduplication Cleanup Job.
Let’s consider another scenario: Contact Service has received a message to add a contact, has succeeded inserting a new record, but crashed just before dispatching a ”Contact Added” event. While we hope this could happen rarely in practice, it still can, and is a pretty bad situation: our orchestrator will never “hear back” from the Contact Service about an outcome of this specific step, even after the Contact Service gets restarted. In general with modern technology’s built-in high availability we can assume that our services will be able to recover from crashes within minutes, and perhaps even seconds. So the real issue here has to do with losing the “Contact Added” message we were about to send back to the orchestrator in the process. Well, here comes Transactional Outbox design pattern to the rescue. We create another table in each of our services' databases and name it MESSAGE_OUTBOX - or similarly. When we update or insert our business data record, we insert a corresponding message to the MESSAGE_OUTBOX table as a part of the same transaction. Then we have another background service job, let’s name it Message Relay Job, which reads this table and sends messages to the topic. This way even if we crash, we will still be able to send the right message once we are back up.
These three considerations apply to any of the saga participants despite the type of saga. In our case, they apply to the orchestrator just as well as they do to other microservices. Now we can state that we have made our saga resilient, but we had to introduce extra complexity including 2 new background jobs. Not only that, but also the time that it takes to run through our saga has grown significantly compared to the good old monolith transaction.
User Experience Implications
Let us now assess the impact on the user experience. If a message relay job in each service checks for messages every 1/2 a second, then the total time that it takes for all 4 services to run their tasks may exceed 1/2 x 4 x 2 = 4 seconds + the ASB time. Clearly, this puts our First Registry scenario in the class of long running API calls, as far as our User Interface is concerned. To make sure that user experience is not degraded we have to resort to the asynchronous request-reply pattern.
Here it is shown as it applies to our saga architecture:
A Closer Look at a Microservice
We saw how each microservice needs to have additional plumbing to participate in sagas. Schematically we can present each microservice as shown below:
Business logic is at its core, and adapters to external world are at its edges: database, ASB queues and topics, Web API. Let us take Vessel Detail Service as an example, and magnify the above picture. We will get a more involved structure, where only key classes are shown:
On the diagram:
The database only shows new tables used to support transactional outbox and message deduplication.
“Traditional” business logic is colored in green. This includes our standard web controllers, repositories and models.
Types supporting commands are colored in purple. Here we have:
AddVesselDetailCommand
representing the actual command message and its parameters;CommandListenerJob
- an implementation ofIHostedService
interface, a background job listening for commands over a message queue;MessageParser
is responsible for converting messages into commands;MessageDeduplicator
is responsible for de-duplication of incoming commands. It is usingCommandRepository
, which is also used byCommandRepositoryCleanupJob
, which maintains the COMMAND_REGISTRY table.
Types supporting messaging are colored in cyan. They include:
VesselDetailAdded
event model, which is created byEventFactory
and saved toOutboxRepository
.The already existing
VesselDetailDBContext
is colored cyan because it will need to updateOutboxRepository
andVesselRepository
within a transaction.Lastly
EventRelayJob
is working withOutBoxRepository
to fetch and send the events to an ASB topic.
As you can see, these are only the core types supporting the functionality. Actual implementation will have more classes and also configuration. The good news is that this plumbing is the same for all microservices.
Vessel Registry Controller Microservice: Saga Orchestrator
It is similar to other microservices in terms of its structure, but has additional features to handle saga orchestrations. Before we look at class diagram for the service, let’s come up with a formal definition of the First Registry saga. A good way to do this is using a Finite State Machine diagram:
As shown on the diagram, we have several “happy path” states on the top row, and a few compensatory states below. There are a few points that we can make about this diagram:
This diagram is a way to present the requirement to the saga explicitly. It makes certain assumptions about desired behavior of the system. For example, we assume that the “Adding Registry” state is a pivot state in the saga, meaning that if it succeeds, we are not rolling anything back. In other words, this diagram reflects an assumption that if “Adding Registry” results in success, and subsequent “Updating Work Item” fails, we leave the work item in an inconsistent state, which will need to be manually corrected, but we leaving all three vessel registry records (contact, vessel detail and vessel registration) in a consistent state with each other. An alternative design of this diagram could require rolling back the entire saga from the point when “Updating Work Item” stage fails.
Despite all measures for enhancing saga resiliency that we talked about earlier, sagas can still fail and lead to data inconsistencies. In the “eventual consistency” concept we started our analysis with the term “eventual” is ambiguous. Is it seconds, minutes or maybe days? If we establish thresholds for how quickly sagas must complete, then we have to account for their potential failures and data corruption it may cause. The likelihood of such failures is low, as it can be thought of as a multiplication of low probability of failure by the low probability of rollback failure. And so it is not feasible to add additional rollback or retry layers, as it adds complexity. A more rational approach is to invest in being able to pinpoint quickly what specifically is failing, and there is hardly a better strategy than implementing strategic diagnostic logging to achieve this goal.
As the diagram above is an example of a classical finite state machine, the sagas and their states can be modeled using State design pattern. The reason why all states have “pending” connotation is because this approach facilitates making sagas isolatable. Or put another way, it allows handling concurrent requests easier.
Let us now inspect the structure of the orchestrator microservice:
To be continued…