An Introduction to Message Queue

Just the other day, a friend of mine asked me to explain Google Pubsub to him. Being a good friend, I explained it to him in a few sentences and he seemed to get it at the time. However, a while later after he tried it for himself, he asked me a few follow up questions of which I couldn't answer in a few sentences like before. So we started a voice call and I went on to launch a rather lengthy explanation complete with analogies and the like.

He still sounds somewhat unconvinced.

Maybe I lost him somewhere along the way with my crappy analogies?

Anyway, this back-and-forth made it dawn on me that the concept of message queue may not come naturally even to seasoned engineer like my friend. I myself understood Google Pubsub relatively quickly because I have prior encounter with Kafka after all. Therefore it is very reasonable for people - including experienced engineer - to not grasp the concept on first try.

So here I am. Writing an article so I can just point people here in case I get asked again in the future. And who knows, maybe my friend will be convinced this time?

Before we continue, I want to note that this article use Google Pubsub's terminologies since it is the one that I used recently, but I am pretty sure similar abstractions exist in other message queue like Kafka or RabbitMQ. Also each message queue has different implementation so it is possible that the way it work is different than the one outlined in this article. Nevertheless, I believe the general concepts are the same and it is useful to undertand them.

With that out of the way, let's start!

What is a message queue?

In short, message queue is a queue for messages.

.....I know, my high school teacher would probably give zero point to that answer if the question somehow ever appeared on a test. But unfortunately it is hard to explain without diving into various facets that made it works (which I will do but in later section).

For now, all you need to know about message queue is:

It can receive messages.
It can store messages.
It can emit messages.

Why use message queue?

Imagine you have two services, service A and service B.

Then you want service A to be able to send messages to service B. The typical and straightforward way to accomplish this is by using a REST API. Let's use POST for this example. So something like this:

Now, what are the potential issues with this approach?

There are several that I can think of:

1. If service B is unavailable, then the POST request by service A will fail and the message will disappear into the void.

2. It is possible that service A has a higher throughput than service B. Let's say service A can process 1000 requests per second while service B can only process 100 requests per second. So on peak traffic, service A can send up to 1000 POST requests to service B while service B can only process 100 of them. In this case, other than the 100 messages that manage to go through, the rest of the messages will be lost.

And this is only if the service B does not crash from receiving way more messages than it can handle in the first place...

3. If in the future service A need to send messages to services other than service B as well (say, service C and service D), we will need to modify service A to accommodate that. Depending on the circumstances, this can be time-consuming and prone to mistake.

There are more than one solution that we can use for each of these potential issues. But since this article is about message queue, let's see how message queue will help us to tackle them.

How message queue work

Instead of having service A sending POST request directly to service B, let's put a message queue (MQ) between them.

In this schema, instead of sending its messages directly to service B, service A will publish them to the message queue. Then, service B will receive messages because it has subscribed to the message queue.

Let's examine each of these steps in more details, starting with the publishing flow.

Publish

When the publisher/producer (service A) send a message to the message queue, the message queue will classify the message according to its topic.

Simply put, a topic is just a label that will be used by the message queue to categorize and group the messages it received. A message queue can have a lot of topics, but a message can only have one topic. The available topics are usually defined by the people who set up the message queue.

How does the message queue know which topic a message should be classified? Usually, the publisher will tell the message queue which topic that a message belongs to when it send the message.

Once the message has been received by the message queue, it will store the message on a queue according to its topic. Generally, the message will stay on the queue until it has passed its retention period (like 1 week or 1 month) before it is going to be deleted. In this retention period, the message is available to be emitted to the subscriber(s) of the topic in a first-in-first-out (FIFO) manner, hence a queue.

By the way, it is also possible for one publisher to publish to multiple topics, and for multiple publisher to publish to the same topics. Therefore the publisher and the topic can be said to have an m:n relationship.

Another thing to note, is that it is possible for the message queue to be unavailable (either because it is down or network issue). In this situation, one way to preserve the messages so it would not be lost is by using something like the transactional outbox pattern.

Now that we have (very broadly) desccribed the publishing flow, let's move on to the subscribing flow.

The subscriber/consumer (service B) is the one who will receive the messages emitted by the message queue for the topic it has a subscription for. True to its name, subscription here means that the subscriber is subscribing for the topic, not unlike how a person can have a subscription for a newspaper.

Similar to publisher, the relationship between subscriber and topic is also m:n. A subscriber can subscribe to a multiple topics, while a topic can have multiple subscribers.

Once a message has been received by the subscriber, it can send an acknowledgment (ACK) reply back to the queue. This acknowledgment is a signal to the message queue that a particular message has been successfully delivered. The message queue will then use the acknowledgments it has received to keep track which messages has been delivered for a subscription and which messages has not. This way, the messages that has been delivered will not be sent again and the messages that failed to be delivered can be handled appropriately.

Also to be noted is that the subscriptions are independent of each other. This means an acknowledgment for a message will only affect that particular subscription on the message queue, and that very same message can still be sent for another subscription that has not give an acknowledgment back.

In the illustration below, service B is in the process of replying with an acknowledgment for message 1. At the same time, service F had previously acknowledged message 1, so it is currently processing message 2.

There are two models for the message delivery, push and pull.

In the push model, every time a new message come, the message queue will immediately send it to the subscriber of the message's topic.

In the pull model, the subscriber will be the one sending a request for new messages to the message queue. Then the message queue will respond with the messages that hasn't been acknowledged yet by that subscriber.

A bit of a catch I experienced for using the pull model in Google Pubsub: it turn out Google Pubsub doesn't always return all unacknowledged messages. So for example if there are 5 unacknowledged messages, it is possible only 3 messages will be returned when the subscriber send the first request, the other 2 messages were returned only when the subscriber send the second request. Therefore if we want to keep the latency low, the way to do it is by having the subscriber continuously sending multiple requests.

Messages can be lost

Like it or not, network failure will happen one way or another, and most likely the cause will be out of your control. It can be because your network provider accidentally set a wrong configuration, or a thunderstorm disrupting the satellite signal, or even because a shark is attacking the undersea internet cable. Well, if all of your components (publisher, message queue, subscriber) live in the same data center, the chance of network failure is minimized somewhat, but it can still occur. In any case, when network failure happens, there is possibility that one or more messages can be lost.

Moreover even if the network is fine, it is possible that the receiver are having issue or down completely, causing it to be unable to receive and process message properly. The end result is the same with network failure, one or more messages might be lost.

To manage this problem, there are three approach that we can choose for the message transmission (be it from publisher to message queue and from message queue to subscriber):

At-most-once
The message will be sent at most one time. In other words, the sender will try to send the message once and just forget about it, not caring about whether the receiver managed to receive the message or not. This approach can be used when a bit of data loss is acceptable, for example if the messages are for log data of a noncritical function.
Exactly-once
The message will be sent exactly one time. Ideally this is what we want, but in reality the complexity of implementing it is high, making it unsuitable for most cases.
At-least-once
The message will be sent at least one time. Basically the sender will keep sending a message until it receive a response indicating that the message has been successfully received (between message queue and subscriber, this response is the ACK reply).

In practice, the approach taken by most cases is at-least-once. This approach necessitate the receiver to handle the messages appropriately. That is to say, the receiver has to process the first message it receive, but not cause problem when it receive the subsequent duplicates.

Broadly speaking, the two most straightforward way to do it are:

Give each message a unique ID, and keep track on which ID has been processed and which ID has not (using database/cache/others), therefore if a message come with an ID that the subscriber has seen before, the message can safely be rejected.
Make the process on the subscriber functionally idempotent. In other words, given the same input, the output of the process will always be the same, so it will not matter if the same message is processed more than once. However, this is more complex to build and might not be possible for every kind of workload.

There are two possible receiver, the message queue and the subscriber. The message queue usually already have built-in mechanism to handle this, so we only need to implement this for the subscriber.

A little addendum about error handling

Let's say a subscriber received a message successfully, but then failed to process it due to some issue. Assuming the approach we chose is not at-most-once, should we set the subscriber to send an ACK reply to the message queue or not?

If the subscriber does not send an ACK reply, this means that the message will be sent again by the message queue. The consequences are:

If the issue that made the subscriber failed in the first place is something temporary (like the database being down momentarily), then resending the message is good. The message will be able to be processed properly once the issue is resolved.
However if the issue is something more lasting (like a bug in the code), then this can be bad. The message will be sent again and again but the subscriber will keep failing to process it until the underlying issue is resolved or the message passed its retention period. And since every new message that encounter the issue will meet the same fate, these messages will accumulate overtime and pollute the message queue. This will make the subscriber slower in processing new messages, since the new messages has to queue together with those messages that stuck in a limbo because of the issue.

On the other hand, if the subscriber send an ACK reply even though it failed to process a message, the consequences are the opposite:

If the issue that made the subscriber failed in the first place is something temporary, then the message will still not be sent again even when the subscriber recover, so this can be bad.
If the issue is more lasting, the message will not pollute the message queue and delaying the newer messages as it will not be sent again, therefore this is good.

Ideally, the subscriber should be able differentiate the type of issue it experienced and act accordingly. If the issue is temporary then it should not send an ACK reply when it failed to process a message. But if the issue is lasting then it should send an ACK reply even when a message failed to be processed. However, depending on what you are working with this can be non-trivial to implement.

Tying it all together

Remember the simple sketch you saw near the start of this article?

Who knew that just these 3 unassuming boxes actually need a lot of words to explain? If you are still reading up until this point, you can buy an ice cream to congratulate yourself :)

Since we have covered the mechanisms of message queue on a high level, we can now explain how message queue can solve potential issues with a simple POST request that is mentioned earlier. In case you forget, here are the potential issues:

1. If service B is unavailable, then the POST request by service A will fail and the message will disappear into the void.

And this is only if the service B does not crash from receiving way more messages than it can handle in the first place...

So how can message queue solve them? Well, more or less like this:

1. If service B is unavailable, the message queue will store the messages until service B come back online, therefore no message will be lost.

2. If service A (the publisher) has a higher throughput than service B (the subscriber), service B can use the pull model for the message delivery so it can consume the messages on its own pace. Of course, if service A has higher throughput than service B all the time then message queue would not help much, but in that case nothing can save you other than upgrading service B.

3. If in the future service A need to send messages to other services like service C and service D, then those services can just connect to the message queue and subscribe to the relevant topic, no need to bother service A at all.

That's it. There are a lot more ground to explore with message queue, but because the title of this article is "An Introduction to Message Queue" and not something like "A Deep Dive to Message Queue", I will stop right here (and I am not qualified to write a deep dive anyway). I hope you can find this useful and have a good day!

Fahmy's Blog