Diagnostics in Microservices and their Consumers

The portion of Internal Vessel E-Registry application responsible for handling service requests is in production, the next big step is adding the portion handling vessel registry data, which is going to be built on multiple microservices. The existing approach to diagnostics within the application was sufficient so far, but will become a bottleneck to its future evolution.

The following is a list of recommended measures with the objective to re-envision and refactor the diagnostics layer of the application.

1. Processing Errors

As shown on the diagram below, a human process could look as follows:

User works with Vessel E-Registry application
Application experiences an error
User creates a support request supplying the correlation Id and the timestamp.
The developer receives a production issue assignment with a correlation Id and a timestamp.
Using diagnostic logs the developer finds all necessary details about the reported error.
The issue is resolved.

There are two conditions for establishing this process:

An application must output correlation Id and timestamp with any exception it experiences
End users should be trained to pass along correlation Id and timestamp to the support services

We will now focus on what’s needed to meet the 1st condition, and to make step #5 in the diagram effective.

2. Correlation Ids Everywhere

The purpose of a correlation Id is to uniquely identify an instance of interaction within the system from its beginning to its end. An example of an interaction is when a user clicks a button on a page, or loads a page, or applies a filter to a table of results. While not all interactions are initiated by the end users and some can be triggered when time elapses, it is safe to say that within Vessel E-Registry and amongst all services it depends on a vast majority of interactions are user-triggered, even when these interactions result in asynchronous message exchanges.

The value of correlation Ids is in their utility for matching all events logged by each component of the application to a specific interaction. This value gets diminished quickly when correlation Ids are failed to be passed around between components of the system. Examples of this are:

Not capturing correlation Ids on UI events
Not passing correlation Ids from UI to API calls
Not passing correlation Ids with asynchronous messages
Not logging each call to a database with correlation Ids

While it is good for an API to generate a new correlation Id if none was supplied with the call, it is better to refactor the clients of the API to supply the correlation Ids in the first place. Same applies to asynchronously dispatched messages or events. The key point therefore is to establish a “correlation Id culture”, or a standard, where virtually any inter-component communication passes correlation Ids around.

3. Creating and Passing Correlation Ids in API Calls and Asynchronous Messages

A correlation Id is a GUID number, therefore it is easy to generate a new one using .NET Guid.NewGuid() method. Once generated, correlation Ids are passed alongside API calls. Currently Vessel Registry is using HTTP headers to do this. Asynchronous messages should standardize on having a mandatory CorrelationId field.

4. Logging Events with Correlation Ids

It is important to log events with the correlation Ids at a minimum when:

Correlation Id just got created
An API call, a database call, a call to Redis or Cosmos DB, etc. is about to be made
Once any of the above calls are made (successfully and unsuccessfully)
A message is about to be dispatched to a service bus
A message just received from Azure Service Bus
A configuration value is accessed
An error has occurred

Reason: the following points allow reconstructing historical interaction sequence between microservices and provide an opportunity to trace values of parameters being passed around.

In addition to this list it is useful to log informational messages once application logic is in any meaningful point when executing a business transaction. This would provide a clearer picture of the flow of events.

Vessel E-Registry has been using Serilog as a logging platform, and API services are using a middleware to extract correlation Ids passed in headers from the clients, and pushing them as a property to the log context:

        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            await _next.Invoke(context);
        }

When a service is listening to a message over Azure Service Bus, it can use the same approach to create a CorrelationId property in Serilog. Here is an example of how one could do it:

   // The following methods are invoked by  ASB client when a message is received.       
    private Task ProcessErrorAsync(ProcessErrorEventArgs arg)
    {
        // When there is no way to get a correlation Id create a new one.
        using (LogContext.PushProperty("CorrelationId", Guid.NewGuid()))
        {
            _logger.LogError("Failed to receive message", arg);
        }
    }

    private async Task ProcessMessageAsync(ProcessMessageEventArgs args)
    {
        // Extract correlation Id from the message
        var correlationId = GetCorrelationId(args);
      
        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            _logger.LogInformation ("Received message");
            ProcessReceivedMessage(args);
        }
    }

Using LogContext.PushProperty() temporarily overrides the value that may already exist for the same property within the scope of the using{} block. Pretty convenient.

5. Coding Recommendations

Prefer ILogger<T> instances provided via dependency injection to static types like LogAdapter. This makes it easier to maintain the code and change logger behavior through configuration.
Avoid catching exceptions. Avoid try-catch blocks. In particular, avoid catching generic exception of type Exception. When working with unmanaged resources use try-finally or using blocks instead of try-catch where possible. Reason: we can rely on exceptions propagating up the call stack and dealing with them in one or few central locations, as opposed to everywhere. There are 3 exceptions from this rule:
1. The UI logic, as it does need to display an error message to the user and so has to use try-catch blocks.
2. Circuit breaker or re-try policies in services or in the UI. To implement such logic one has to catch exceptions returned from services being invoked, then try again. Once the calling component “gives up” though, it should throw an ApplicationException with necessary context details.
3. Sagas may need to be triggering rollback logic in response to exceptions. Therefore exceptions may be a part of their normal state machine flow, depending on specifics of implementation.
Create catch-all exception handlers. For Web APIs these could be implemented as middleware components, for async message processing - these should be at the root of hosting components, for example the ones implementing IHostedService interface. Reason: by concentrating “catch” logic in just a few places it makes the code cleaner and easier to maintain.
Throw exceptions. Validate input parameters and throw ArgumentException when assumptions about them are broken. Similarly throw ApplicationException when business assumptions are broken. Reason: it is much easier to troubleshoot an application, which explicitly complains about broken assumptions as opposed to the one, which unexpectedly throws the likes of NullReferenceException, which can be viewed as a bug in the code.
Log information messages at important points as described earlier and more often. The optimum intensity of information messages is subjective. The purpose is to make it sufficiently clear what is going on with the application to a person who is troubleshooting it and is not familiar with or has no access to its source code.

6. Where to Log Messages?

It is not only sufficient, it is preferable to log application messages to console, according to 12-factor apps best practices. Reasons:

Logging configuration and application itself gets simplified.
It is the responsibility of the platform, not the application, to collect messages from stdout (console) and deliver them to a log provider. And Azure Web Apps service fulfils this responsibility, being able to send log stream data to Log Analytics workspace via Application Insights.

As it applies to Vessel E-Registry application, other logging sinks including sending logs to blob storage or Azure Table become redundant. Especially so in the context of the power that exists with querying messages logged to Log Analytics workspaces.

Log Analytics workspaces are built on top of Microsoft’s performant distributed database, which was recently marketed also as a standalone engine under the name of Azure Data Explorer. This is where the recommended diagnostic logging strategy really shines: with correlation Ids being supplied alongside the diagnostic messages it becomes easy to:

a) - combine these messages in one central repository, and

b) - to query them effectively.

The screen shot below illustrates querying logs in Log Analytics workspace for the system table named AppServiceConsoleLogs, which is populated by log stream from running E-Registry site in ACC environment. The query uses Kusto Query Language (KQL) syntax.

7. A Note on Privacy and Security

While it is helpful to know the actual values of parameters passed when an error has occurred, it usually gets in conflict with security and privacy requirements. For example, it may be convenient to know the UPN or an Email address of a user making a call that has errored, but once this information is in the logs, it is very difficult to remove it, plus storing it there presents a security risk. Therefore it is a general practice to avoid storing Personally Identifiable Information (PII) in the logs. Replacing UPN values or Emails with Azure AD object Ids does not expose PII information.

While developers should always be mindful about which data to log to console, sometimes it is hard to avoid leaking Email addresses or similar pieces of information. One example where this may happen is when an Email address is used as a part of a RESTful URL of a service. One countermeasure we could employ here is to use an enricher for Serilog, which applies regular expressions to well-known PII tokens and masks them from logs automatically. We should consider using it.

8. Audit Logging

Diagnostic logging could act as a form of audit logging. It is usually a superset of what audit logging would need. There are two problems to address though:

One may need to distinguish audit log events from regular diagnostic log events, for example to be able to report on them effectively.
A bigger issue is that audit logs often contain PII information, and as shown above it is a security concern to keep it in diagnostic logs. Diagnostic log storage has one set of trust and data retention policies, while audit log storage may as well have another set of policies. This leads us in the direction of storing audit information separately from diagnostic information.

To address these concerns we can take the following two measures:

Distinguish audit log events from diagnostic log events, for example by pushing a special property to log context: LogContext.PushProperty("LogType", "audit").
Leverage Serilog’s support of conditional logging to sinks. For example, in our Program.cs where we wire up Serilog, we can write the following or similar:

        Log.Logger = new LoggerConfiguration()
           .ReadFrom.Configuration(config)
           .Enrich.FromLogContext()
           .Enrich.WithProperty("Application", Assembly.GetCallingAssembly().GetName().Name)
           // Conditionally write audit events to Azure Service Bus sink
           .WriteTo.Conditional(
                logEvent => logEvent.Properties.ContainsKey("LogType") && 
                    logEvent.Properties["LogType"].ToString() == "audit", 
                sinkCfg => sinkCfg.AzureServiceBus(connectionString, queueName))
           // Only write diagnostic events to console sink
           .WriteTo.Conditional(
                logEvent => logEvent.Properties.ContainsKey("LogType") && 
                    logEvent.Properties["LogType"].ToString() == "diag", 
                sinkCfg => sinkCfg.AzureServiceBus(connectionString, queueName))
           .CreateLogger();