Table of Contents |
---|
Introduction
Saga is a design pattern for ensuring data consistency when implementing update transactions in microservices architectures. We are going to take a concrete scenario of creating vessel registry record, aka “First Registry” and see how saga pattern can be applied to it.
...
Let us now assess the impact of saga-based transactions on the user experience. If a message relay job in each service checks for messages every 1/2 a second, then the total time that it takes for all 4 services to run their tasks may exceed 1/2 x 4 x 2 = 4 seconds + the ASB time. Clearly, this puts our First Registry scenario in the class of long running API calls, as far as our User Interface is concerned. To make sure that the user experience is not degraded we have to resort to the asynchronous request-reply pattern.
Here it is shown as it applies to our saga architecture:
...
A Closer Look at a Microservice
...
The implications of such API behavior to the UI are significant: instead of letting users wait for 2-3 seconds for a transaction to complete, we need to rethink the UI experience such that responses are returned quickly and then users are visually notified once the tasks they submit get completed. User experience inspiration may be taken from the Azure Portal, where this style of UI behavior is standard.
A Closer Look at a Microservice
We saw how each microservice needs to have additional plumbing to participate in sagas. Schematically we can present each microservice as shown below:
...
Business logic is at its core, and adapters to external world are at its edges: database, ASB queues and topics, Web API. Let us take Vessel Detail Service as an example, and magnify the above picture. We will get a more involved structure, where only key classes are shown:
...
Vessel Registry Controller Microservice: Saga Orchestrator
It This microservice is similar to other microservices in terms of its structure, but has additional features to handle saga orchestrations. Before we look at class diagram for the service, let’s come up with a formal definition of the First Registry saga. A good way to do this is using a Finite State Machine diagram:
...
As shown on the diagram, we have several “happy path” states on the top row, and a few compensatory states below. There are a few points that we can make about this diagram:
This diagram is a way to present the requirement to the saga explicitly. It makes certain assumptions about desired behavior of the system. For example, we assume that the “Adding Registry” state is a pivot state in the saga, meaning that if it succeeds , then we are not rolling anything back. In other words, this diagram reflects an assumption that if “Adding Registry” results in success, and subsequent “Updating Work Item” fails, we leave the work item in an inconsistent state, which will need to be manually corrected, but we are leaving all three vessel registry records (contact, vessel detail and vessel registration) in a consistent state with each other. An alternative design of this diagram could require rolling back the entire saga from the point when “Updating Work Item” stage fails.
Despite all measures for enhancing saga resiliency that we talked about earlier, sagas can still fail and lead to data inconsistencies. In the “eventual consistency” concept we started our analysis with from, the term “eventual” is ambiguous. Is it seconds, minutes or maybe days? If we establish thresholds for how quickly sagas must complete, then we have to account for their potential failures and data corruption it may cause. The likelihood of such failures is low, as it can be thought of as a multiplication of a low probability of failure by the a low probability of rollback failure. And so it is not feasible to add additional rollback or retry layers, as it adds complexity. A more rational approach is to invest in being able to pinpoint quickly what specifically is failing, and then fix it, and there is hardly a better strategy than implementing strategic diagnostic logging to achieve this goal.
As the diagram above is an example of a classical finite state machine, the sagas and their states can be modeled using State design pattern. The reason why all states have “pending” connotation is because this approach facilitates making sagas isolatable. Or put another way, it allows handling concurrent requests easier.
Let us now inspect the structure of the orchestrator microservice:
...
, it allows handling concurrent requests easier.
Here is a variation of the previous diagram, which assumes that an orchestrator can issue some commands in parallel, such as “Add Client” and “Add Vessel Detail”.
...
It contains fewer states, looks simpler, and takes less time to complete, but this simplification needs to be balanced against adding complexity to the Orchestrator, which would now need to receive events back from two services - Contact Service and Vessel Detail Service in order to make a decision on which next transition to make.
One note about resiliency before we look at the class diagram: in addition to the measures common to all data services we also need a way to persist states of our saga. Since our orchestrator is a single point of failure, we cannot let it crash and leave a running saga in an undetermined state. Assuming that high availability measures will be made for the orchestrator on a platform level, we still need to ensure its data integrity, so that it can pick up where it has left at any time.
Let us now inspect the structure of the orchestrator microservice:
...
The classes on the diagram could be broken down as follows:
Classes colored green are responsible for handling web requests:
WebController
is the ingress point for the saga, it implementsCreateFirstRegistry()
method and exposes a status endpointGetStatus(correlationId)
.The role of
OrchestratorProxy
is to validate and then send a command to the orchestrator to a message queue. This as shown on the sequence diagram above allows service to respond to the UI caller quickly. From this point on the "saga movement" occurs only asynchronously.
Classes colored blue are responsible for consuming events:
The orchestrator must process events received from each of the microservices participating in sagas. This is one differentiator from a standard microservice, where consumption of events is strictly optional. This is the focus of
EventListenerJob
, which is shown on the diagram. Other classes supporting message deduplication and transactional outbox pattern are omitted. Once an event is received and parsed it invokesOrchestrator
to load the corresponding saga and respond to the event. This is the mechanism of saga transitioning through its states.
Classes colored purple are responsible for dispatching and managing commands:
Here we have 2 background jobs:
CommandListenerJob
- for listening to incoming commands like the one enqueued byOrchestrationProxy
, andCommandRelayJob
- which acts dispatches commands to saga participants. Classes supporting repository access, command parsing and creation are not shown.
Classes colored orange represent core orchestrator components. One thing we should emphasize is the orchestrator is designed to handle multiple sagas for various business scenarios. Each “saga type” can be thought of as a set of classes representing states in a finite state machine diagram, like the one we have discussed earlier.
Orchestrator
acts as a controller. It drivesSagaFactory
, which creates new sagas given their names, allowing us to distinguish between each of the "saga types". We would useSagaFactory
when initiating a new saga, for instance.Saga
type along withSagaState
and its derivatives represent State design pattern. A createdSaga
is an instance of a "saga type", and it also stores input data passed by initiating asynchronous command, such as "Create first registry", in itsInputParameters
field. Each concrete instance ofSagaState
"knows" how to handle incoming events of success or failure, which commands to dispatch, and which next concrete instance ofSagaState
to set on its parentSaga
instance. In essence the finite state machine diagram shown earlier is implemented by the concrete instances ofSagaState
type.SagaRepository
is responsible for persisting and retrieving saga instances along with their current state and original input data from durable storage (database).
Database:
Since deduplication is needed for both incoming commands and events, we combine their registry into one table named INBOX_REGISTRY.
In order to reliably dispatch commands to consumers we use the OUTBOX_REGISTRY table.
SAGA_REGISTRY table is where saga instances are persisted.
As we mentioned above, we move through sagas as the Orchestrator
consumes and handles events that it receives from participating microservices. Let us picture one instance of such transition:
...
On the diagram:
The Orchestrator receives an event, which was already deduplicated and parsed, and asks SagaRepository to find an existing saga instance by correlation Id. SagaRepository finds and constructs a specific type of Saga, let’s say it is “Create First Registry”, and loads it with its current state and original input parameters that it has received when it was initiated.
The
Orchestrator
asks theSaga
instance to handle the event, and it delegates this to aConcreteSagaState1
. It is hardwired to execute aConcreteCommand
instance and change parent saga's state toConcreteSagaState2
. An example of such interaction could look as follows: the concrete state the saga is in could be "Adding Registry", and the event it is passed on could be "Vessel Registry Rejected". The "Adding Registry" state would dispatch a command named "Rollback Vessel Details" to Vessel Detail microservice and change its parent saga state to "Undoing Add Vessel Details".“Dispatching a command” means saving it to repository. What happens next is both
SagaRepository
andCommandRepository
are updated within the same transaction controlled byDbContext
. This step guarantees that saga state and the outgoing command are consistent with each other. The diagram ends here, but what happens next is theCommandRelayJob
reads the command from OUTBOX_TABLE and writes it to the message queue of the target microservice.
To recap on where we are at - the Orchestrator is a finite state machine, which knows how to listen to events, read and persist sagas in their interim state, and to dispatch commands to participating microservices. It should not grow into a monolith as long as all it does is plays event and command dispatcher role. What will change over time is the number of sagas, and composition of their states.
A Note on Isolation
Sagas are notoriously prone to isolation issues. When comparing them to database transactions, and using the popular ACID (Atomic, Consistent, Isolated, Durable) acronym, which is commonly used to describe database transactions, the hardest one to get is the “I” letter - the isolation. Again we can “get by” here due to less stringent requirements to sagas as compared to database transactions. The specific measures we may need to take to improve isolation qualities of our sagas are as follows:
Use correlation Ids everywhere: in messages and when accessing saga states.
Use “pending” connotation on naming of saga states, as this allows making decisions about queueing parallel calls while a saga is in such state.
We may consider adding
CorrelationId
column to each of our tables storing business entities. This will make it easier to support idempotence and isolation.
Conclusions and Next Steps
We have taken an in-depth look at sagas composition and implications that using them brings to the overall application. Using a concrete example of First Registry business scenario we have shown how a saga-based system can work, which core components it needs to have and why. This should be sufficient to begin planning the work and start implementing sagas in our code.
At the same time, we can raise questions related to sagas and to the particular implementation approach described here, which need further analysis:
A “monolith” version of First Registry transaction executed against a single database would take a fraction of a second to complete, while the same transaction under proposed design will take seconds to complete. Is there a way to safely “speed up” our sagas without impacting their robustness?
We have admitted a low but non-zero likelihood of data corruption resulting from failed sagas. There is no such concern with monolith databases. What is going to be our strategy and countermeasures?
The sheer number of classes, messages, queues, topics and background jobs involved in implementing sagas increase system complexity greatly, and bear a risk of making the system brittle. What can be done to simplify it?
The next logical step is looking at platforms such as MassTransit. MassTransit promises to abstract away the burden of reliable messaging between the services, as well as implementing saga orchestrations as state machines. This has a significant potential for reducing the amount of plumbing we would need to build ourselves.
The hope is that this discussion has helped us to deeper understand the underlying challenges presented by sagas, and will empower us to utilize the features that MassTransit offers in a pragmatic and rational way.