Chapter 15
Message queues

Communication between systems through events.

Reading time : 5 minutes

In the API introduction, we saw that a computing system consists of dozens of software components that need to communicate with each other. Their API represents the set of contracts and functionalities they publicly expose.

So far, we have talked about the inputs of a component—that is, what it exposes to be directly called.

Example: "Send me the HTTP GET request /offer/42, and I'll return the details of offer #42."

In this chapter, we will see that a component can also produce outputs: it triggers communication to send information. Developers use this for:

Breaking down a process and triggering asynchronous tasks.
Separating responsibilities between components to build independent and maintainable systems over the long term.

Push Notifications

Let's start with an example:

We have an e-commerce site, a customer has just made a payment, and now we need to initiate the entire order processing chain: fraud analysis, package shipping...

Each step in this chain is developed and maintained by a different team of developers, in a dedicated software component.
The chain is sequential: processes run in order, and one step may block the progress.
All of this must happen asynchronously from the customer's navigation. Now that they’ve paid, they can leave the process, and we will contact them if needed.

The payment component has now finished its task and needs to hand over to the fraud detection component.

We can design this architecture in different ways. Fifteen years ago, we probably would have used a shared data source or a file export. Something like: in the database, the payment updates the order status to "payment validated," and the fraud detection component queries the database every two minutes to find new orders with this status.

This works, but it’s far from optimal:

It’s like a child repeatedly asking "Are we there yet?".
It slows down the process by up to two minutes per step. With twenty steps, delivery is delayed significantly.
The database creates a tight dependency between the two components, which is bad! We'll revisit this in the microservices chapter.

A better approach would be for the payment component to push the information directly to the fraud detection component.

These push notifications can be done with a simple HTTP request. The payment system calls the fraud detection API directly:
*"Hey, fraud detection, check if the customer for order #42 is a fraudster."*

Simple and effective, that's how 90% of inter-component communications work!

However, this approach has two major drawbacks:

The payment component, which is higher up in the process, has a strong dependency on the fraud detection API. Once payment is complete, it shouldn’t be responsible for knowing who takes over and how.

To reverse this dependency, we can set up an observation system: the payment component provides visibility into its internal events. Through its API, it can allow other components to subscribe to events like "I have finished processing this order." With each event, it sends the same HTTP request to all observers without knowing who they are. It’s their responsibility to process the request.
HTTP requests are synchronous, meaning the payment component pauses until fraud detection responds.

In our case, since the payment system doesn’t care about the next steps, it doesn’t need a response from fraud detection. Ideally, the fraud detection system should store this notification and immediately acknowledge the reception without running the actual process. This would require fraud detection to internally manage two steps: storing, then detecting new entries for processing.
Oops! This brings back the same problems we had 15 years ago with polling every two minutes.

This observation system and the two-step separation are commonly used when working with partners where HTTP is the only available option.
Example: Apple pushes a notification to inform me that a customer has canceled their iOS subscription.

But better options exist for asynchronous communication between independent internal components.

Message Queues

Think of the baggage conveyor belt at an airport. One person places suitcases on the belt, another picks them up when they arrive. Each person works independently, without knowing what’s happening on the other side. The airport manages multiple conveyor belts.

This is how message queues work:

The payment component sends a message to a message broker, a system that manages queues. It asks the broker to place the message "order #42" in the "payment completed" queue.
The broker (RabbitMQ is a well-known example) stores the messages until they are read or expire.
The fraud detection component continuously listens to the "payment completed" queue, receiving and processing messages at its own pace.

Isn't this just another shared dependency, like the database we used 15 years ago?

Yes, both components depend on the broker and queue, but this is purely a technical dependency.
Using the database tied the components together through a business concept, making it fragile and subject to change.

Message brokers also provide several technical benefits: speed, high-load handling, delivery guarantees...

However, they have one limitation: once a message is consumed, it is deleted, meaning there can be only one consumer.

Let’s extend our example:

In addition to fraud detection, we also want a financial component to be notified when a payment is completed.

If we continued using a simple queue system, we’d need to duplicate the message into two queues. This would take up more space and, more importantly, force the payment component to adapt based on its consumers.
Not ideal!

Message queues are great for breaking down processing into multiple asynchronous steps when we’re sure there is only one consumer.
What we need in our case is more like broadcasting a "payment completed" event. To achieve that, we need a more advanced system.

Publisher / Subscriber

This works like a radio broadcast, where a message is openly distributed to all listeners. This ensures the producer component remains independent.

However, like a live radio broadcast, you have to listen in real-time, or you’ll miss the message.

This is useful for real-time systems where past data isn’t important, such as displaying exchange rates in different currencies.

But in our case, missing a payment event would break the chain, and the customer would never receive their order.

We need a system that combines the radio-like model with message queues.

Event Streaming Platform

To sum up what we've just discussed, the tool we're looking for must allow:

The payment service to produce a business event "payment completed" without worrying about who will consume it.
The fraud detection and finance services to receive this event in real time and process it at their own pace. They must be able to catch up on events they missed in case of an outage, meaning the events must be stored.

In 2010, LinkedIn engineers faced the same challenges as us.
LinkedIn processes billions of business events every day: new comments, connection requests, subscriptions...
How do you update recommendations, the search engine, and analytics when a user views a profile? How do you notify all followers when a new post is published?

These engineers reached the same conclusion as us: existing solutions were not good enough.

So they developed Kafka, an event streaming platform, a tailor-made tool that meets all these needs:

A producer sends a message to a topic, a kind of queue.
Consumers can subscribe to this topic and read messages one by one. Kafka allows multiple consumers to read the same messages at their own pace.
Messages are physically stored in logs, and retention can be configured for several days. In case of an issue, a consumer can reconnect to the topic and pick up where it left off.
Kafka is a distributed system: it can spread its workload across multiple machines and handle very high traffic loads.

This is exactly the mix between pub/sub and message queues that we were looking for! Plus, Kafka became open-source in 2011, meaning we can use it in all our projects.

Kafka has significantly popularized the concept of event streaming. It has become an essential tool for developing independent and maintainable services. It has revolutionized backend architecture design by introducing a new way of thinking for developers.
Example: once an event is sent, it cannot be updated or canceled. Everything is incremental, you need to produce a second event that either completes or invalidates the first.

At the beginning of the chapter, you mentioned APIs—what does this have to do with Kafka?

Remember this sentence: "An API represents the set of contracts and functionalities that a system exposes publicly." All messages, especially business events, fall under this definition.

Like any outgoing communication:

A versioning strategy must be in place to ensure backward compatibility—fields should never be deleted.
It must be properly documented.
It can be monetized.
It should be designed as an intuitive product that meets a real need.

Chapter 15Message queues