MiddleWay - Comment choisir son broker de messages ?

Anyone who has ever built a distributed system must have considered the issue of communication between the components of the system. A message broker can typically be used to provide certain communication features or properties, such as temporal decoupling or reliability. This especially applies to data exchange between applications, where the distributed system consists of a collection of systems that exchange messages with each other.
Next comes the tricky matter of choosing a broker. Ideally, this choice should be based on the expected communication properties. Different views and different ways of understanding a problem can lead to a great deal of discussion about this choice. Moreover, with human organizations being what they are, sometimes less factual considerations come into play:

the corporate choice of a particular broker as a catch-all solution (sometimes for the following reason);
preconceptions about a particular mechanism or a particular capability of a broker.

NOTE 1: An inappropriate choice doesn’t mean that “it won’t work”; it just means you will introduce some biases and/or require compensation mechanisms that might be clumsy and of uncertain effectiveness on the one hand, and expensive to implement on the other hand.

NOTE 2: This could easily lead to the conclusion that several types of brokers are required over the whole perimeter of the distributed system. It is even possible that several types of brokers could also be required for a given communication (message flow).
In my experience, the two biases just described are especially relevant when the choice is between a log-based broker and a queue-based or subscription-based broker (as defined by the Gartner classification, see the note below). This is the main subject of this article.

Reminder: broker types

Let’s start with a quick review of the terminology related to types of brokers.

Log-oriented : this is essentially a write-only log. Consumers do not “delete” a message after reading it, but simply browse the log based on an offset, which they advance as they read the log. This offset can be managed on the consumer side or the broker side. In the world of Azure, this is the equivalent of Event Hubs (with or without a Kafka surface).
Queue-oriented : the broker creates queues for each consumer and uses a routing system to direct messages toward the consumers’ queues. When the consumer has dealt with a message, it is deleted from the queue. In Azure, this is equivalent to Service Bus.
Subscription-oriented : a consumer (subscriber) specifies both a filter to select the messages or events of interest and an endpoint where they expect to receive the messages. On execution, the broker will then determine which endpoint(s) to invoke for each message based on the rules that were specified. The big difference between this and a queue-based broker is that, in this case, the broker actively pushes the message toward the subscriber’s endpoint. In Azure, this is equivalent to Event Grid.

Reminder: message types

Let’s add a brief note on the different message categories.

Document : a snapshot of the status of an object, such as a customer file or a stock image.
Command : as its name suggests, this is an order issued that reflects a particular expectation of the sender with regard to the receiver(s).
Event : an observed fact, something that has happened. In other words: observation of a change of state.

At first sight, this distinction does not seem very useful if stated in this way. These shortcuts are mnemonic devices that encompass several considerations, which are described below. Finally, you could compare it to Tortuga Island: you don’t know how to go there unless you have already been there. These shortcuts are only effective if you know what’s behind them.

We will gradually get to the bottom of this in the following sections.

The intention / the expectation

One of the first interesting questions to ask is: does the sender system have a specific intention when it sends one message in particular? Is it expecting something to happen at the other end of the line? Or does it need something to happen so it can continue doing its job?

If not, this system can be considered a data supplier, which informs the rest of the world of something only it knew about at a time T. This could be:

Data updates (snapshots of business objects for which it is responsible, i.e., documents). In this case, the document updates are probably valid for a limited time, but apart from that consideration, it is hard to choose a broker based on this criterion alone.
Or things that have happened (i.e.: events). If the sender system has the sole intention of recording these events, then the use of a log-based broker is usually a good option for this requirement but, as in the preceding point, it is difficult to choose a broker without more criteria.

If not – in other words, if the sender has an expectation for the future – then writing in a log is probably not the most efficient solution. Instead, the “producer” will issue a command and in fact will then behave like a client consuming a service. This command might need:

to make sure it will be obeyed only once and/or that it is only valid for a certain period of time – in this case, a queue-based broker is a good candidate to provide this level of functionality;
or to be carefully routed, using filters, only to certain subscribers (service providers, in fact). Furthermore, the result of this command might need to be correlated to the original command/query (typical request-reply pattern). In this case too, a queue-based broker is especially suitable.

Here I have focused on the producer’s expectations, but the same type of question can and should be asked from the consumer’s point of view. In the next paragraph, we will see that expectations can differ widely.

The relationship between systems

In the previous paragraph, I indirectly introduced the problem of relationships between systems via the concept of expectation or intention (implication: the producer’s intention with regard to the consumer(s)).

But this problem goes much further than just finding out what the producer expects. The consumer can also have expectations that do not necessarily align perfectly with what the producer wants. This comes as no surprise, since all of this is well documented and much discussed in the field of Domain Driven Design (DDD which, incidentally, is incredibly rich, informative, and efficient on many levels). Here I am going to take a rough but effective shortcut. For the purposes of our discussions, a message broker will materialize (or rather, drive) a border between two contexts (more concretely, and as a first approximation, between two systems).

Except that, in reality, it’s a bit like the border between two countries: a border has two sides, and the formalities for going from one country (context) to the other depend on the agreements (relationships) between the countries.
Let’s take an example: a warehouse management system provides an almost real-time flow of stock updates. In return for these stock updates, it expects nothing in particular from any consumer. It just says ‘to whom it may concern’ that a particular item has left a certain warehouse in a stated quantity (the same for incoming stock). Whether or not another system is listening to these updates makes no difference to it. No matter what, the stock movement took place, regardless of whether or not anybody is interested.

On the other side of the border, there are systems or partners (such as retail outlets or shopping websites) that need updates, but only for a specific subset of the stock, and possibly at a lower frequency, to supply their own internal cache (for example, once every night). We could also imagine another consumer being interested in all the updates, in order to perform predictive calculations with a view to adjusting the activity of a production line.
In that scenario, the producer clearly cannot accommodate the conflicting demands of the different consumers. You could take it one step further and say that those demands are not the producer’s problem.

Concretely, what effect does this have on the driving of borders? Realistically, relying exclusively on a log-based broker, which could be an appropriate solution from the warehouse management system’s standpoint (see next chapter), is too restrictive because it does not allow an efficient response to the demands of all consumers. You are then faced with the choice of making the consumers carry some of the features (e.g., filtering the flow to take only what interests them) or combining two types of brokers to provide a more complete service to a greater number of consumers.

To spice up the discussion, you can add the question of responsibility: who does what in all of this? This does mean people: which team is responsible for implementing this or that feature. Every development is a socio-technical construction, meaning that beyond the ‘code’ itself, certain areas of responsibility are involved. This subject is far from trivial, and it is crucial for the smooth running of the organization. Among other things, it’s the difference between tactical integration and strategic integration, but that’s a subject for another article.

Confusion between internal needs and external expectations

To take it further, a DDD-type context can need a broker to drive its own internal needs/mechanisms. In the example above, the warehouse management system might consist of several micro-services that produce (and consume) each of the events. This context might make it necessary to publish these events in a log-based broker.

This creates a strong temptation to kill two birds with one stone by simply exposing this broker to outside consumers. And why not – but be careful not to confuse internal needs with the needs of external consumers. If you do, you might shift the load of certain features onto the consumer when they could have been offered on the producer side and therefore shared.

For example: in our scenario, if consumers only need some subsets of the events that are published, then each one will have to read the whole log and do its own filtering. This is quite inefficient for one consumer, and when multiplied over several consumers it becomes a significant waste of resources. In this case, it might be more beneficial to use the functionality of a queue-based or subscription-based broker to drive the context borders.

NOTE : it is technically possible to perform ‘routing’ based on the partition in a log-based broker. But be careful: it’s a form of coupling! This solution might be acceptable within the boundaries of one context, but far less acceptable for a third-party context.

Flow value

Up to now, we have talked about events, but without addressing the distinction between discrete events and event flows. Let’s illustrate the difference using an example from the field of physics: position vs. velocity.
Each position measurement has its own value that is useful for certain applications (e.g., monitoring presence in an area), but the sequence of positions could provide information that is needed for another purpose. You can obtain the velocity by extrapolating from the sequence of positions: this effectively lets you monitor the velocity based on the position measurements.

In other words, the flow of positions has its own meaning, that you can use in addition to each of the separate positions. Another way to look at it is that you are going up one level of abstraction from the basic data (like the derivative in mathematics).
In this type of situation, a log-based broker is strongly recommended.

NOTE : although this is far beyond the scope of this discussion, be sure to bear in mind the limitations of real-time flow analysis (if real-time is needed), because the only effective analysis dimensions (indexes) are time and the partition key. Also be careful of the analysis depth, which should be restricted.

Value of the sequence

A related point to consider is the importance of the sequence. For additional clarity, think about managing a non-nominal case in a flow of messages: what should you do if you ‘miss’ a message or event (for any reason)? Or when the order is changed?
And, most importantly: what is the real impact on the consumer? What is the functional significance?

Take the example of warehouse management: an integration error on one of the stock events (or an inversion of events) can cause stock calculation errors, leading to the replenishment of a particular product. This results in a loss of efficiency over the whole processing chain and it wastes company resources.

In a case like this, it can be risky to choose a solely subscription-oriented broker. A log-oriented broker is clearly more advisable. Because the concept of order is important, however, partitioning should be done very carefully.

Furthermore, how should the non-nominal case be processed? In other words: what should you do about the message that couldn’t be processed? There are four possible options.

Halting the consumption process : this is the safest and most conservative mode with respect to the sequence value. It generally requires a manual action to deal with the error case first (whatever the process) and then to restart automatic processing. This option becomes unrealistic when the message volume is high or processing times are tight. It can, however, be better than taking the risk of corrupting the consumer.
Deadlettering with a view to possible resubmission : technically feasible, but restrictive. You can’t simply resubmit the message in the flow, because that would amount to corrupting the sequence. The consumer-side message reintegration mechanism must therefore be robust.
Simple deadlettering (“skip”) : this is for toxic messages, which can occur when the sender evolves and starts to publish messages that do not (yet) fit a planned use case of the consumer.
Replay from a point in the sequence : this option is described in greater detail below.

Myth of the “replay” by a system from an offset

These problems of managing non-nominal cases give rise to a point that is often made and often aims to tip the scales in favor of a log-based broker: that messages can be replayed from a certain point in time. This is obviously a tempting argument, but be careful not to confuse a gadget with a requirement.

The question to ask is therefore: for which known use case is this actually needed?

Be careful not to confuse it with a backup/restoration mechanism: a log MUST NOT contain an infinite depth of events. It’s well known in the database world that the transaction log must not be allowed to keep inflating indefinitely. Full backups must be performed regularly to empty the logs and keep a snapshot of the database status. This means that a log does not replace a backup system.
Be careful not to confuse this with a data recovery mechanism : if the only real need is to add a delay between publication and consumption (e.g., staggered start-up), there are other solutions, such as adjusting the TTL on a queue-based broker. Furthermore, it is generally not very efficient to base your nominal operation on the constraints of an exceptional use case such as system installation or data recovery.
For a case of retry-on-error, you can actually imagine a “rewind” / “replay” mechanism. It’s then vital to allow for this in the design and operations. In addition:
- Wouldn’t it be sufficient to use deadlettering + resubmission on a queue-based or subscription-based system? In other words: is it important to replay the whole flow from the error (i.e., it’s the flow that has a value, with its sequence of events)?
- The consumption processes must be idempotent; otherwise the cure could be worse than the disease.
- Be careful of the impact on operations: what constraints are placed on the consumers concerned? For example, prior restoration of the consumer’s state? Service interruption during the replay? etc.
- On implementation, beware of the ‘rewind’ mechanism of the message flow consumption (i.e., how to return to an earlier offset); this can be performed more or less easily depending on the broker and the consumer. Using the example of an Azure Function: with Azure Event Hubs, you have access to offsets in a storage. With Kafka on the surface of Event Hubs, you don’t have direct access to the offset; you must use the Kafka APIs. In any case, the process must be planned, in addition to the tools.
Conclusion

This overview of the main issues illustrates the importance of integration analysis in the architecture of distributed systems. Note also that I have barely scratched the surface of some fundamental subjects such as the organizational impact of the integration subjects.
By way of conclusion, I offer a rough shortcut below.
- If all the consumers are controlled (i.e., within a context, or simply as an input to that context, to receive messages), then the situation is favorable – if the needs justify it – for the implementation of a log-oriented broker. Concretely, this scenario will occur within an area of responsibility; in other words, when the systems involved in the exchange are under the remit of the same team, or at an inbound border, i.e., to receive messages from other contexts (fan-in pattern).
- On the other hand, this choice would be more restrictive for exposing outbound borders (i.e., sending messages to consumers from other contexts), or would probably require to add complimentary tools (like Kafka Streams or Stream Analytics) to the technological mix.
Nevertheless, this is just a very personal and limited shortcut, that is only of any use to someone who has already followed the entire path (cf. Tortuga 🙂 ).

How to choose your message broker