Microservices are a popular and widespread architectural style for building non-trivial applications. They offer immense advantages but also some challenges and traps. Some obvious, some of a more insidious nature.

This series describes and compares common patterns for dealing with data and dependencies in microservice architectures. Keep in mind, that no approach is a silver bullet. As always, experience and context matter.

Four different parts focus on different patterns.

The previous article looked at integrating microservices using change-data-capture and related approaches.

This final piece introduces event-driven-architectures (EDA). A technique for integrating microservices and handling data in general. Keep in mind that entire books have been written on this topic. So, we will only scratch the surface and try to convey the general ideas. You can find a set of resources that will help you dig deeper into these topics at the end of this post.

Event-driven architectures - in a nutshell

A coffee shop

Let’s use a motivational example to drive the discussion. This time we do not use some financial context. We use an example familiar to most engineers: the coffee shop. Consider the exchange illustrated by the next diagram.

The synchronous coffee shop

A customer orders a coffee (I). The waiter receives the order and asks the barista to prepare this coffee (II). Now the barista brews the ordered coffee (III). After brewing, the barista returns the coffee to the cashier (IV). The cashier in turn hands it over to the customer (V). The cashier can serve the next customer (VI) and the customer can finally sit down and enjoy the coffee (VII).

This sounds like a straightforward approach. But looking closer, the downsides become clear. Let’s examine two examples.

Many people enjoy their coffee in the morning. They fetch a hot brew on their way to the office for example. This in itself is not a problem. But, picking up new coffee orders is fast, maybe a couple of seconds. Brewing coffee is slow, alas. This can take minutes. So, every day around 8 a.m. our coffee shop gets swamped with coffee orders. The barista struggles to keep up. We cannot meet our customer’s demands because of the hard coupling between ordering and brewing coffee. The disgruntled customers will probably go to another coffee shop in the future.

Or, what happens if the barista burns herself and cannot brew coffee for the time being? Well, since all orders sit in a line the customers have to wait until the issue is resolved. The cashier cannot pick up new orders until the barista starts working again.

The asynchronous coffee shop

Now, what happens if we do things differently, as illustrated by the next diagram.

Microservices!

Again, the customer orders some coffee (I). The cashier picks up the order and asks the barista to brew the ordered coffee (II). Instead of waiting for the beverage, the cashier asks the customer to take a seat (III). This allows the cashier to serve the next customer immediately (IV). In the meantime, the barista brews the ordered coffee (V). When finished, she asks the waiter to serve the beverage (VI) and starts working on the next order. Finally, the waiter serves the coffee to a happy customer (VII).

The important point is that at every step, each actor (customer, cashier, barista, waiter) does not wait for more than necessary. Each actor hands off to another one as soon as possible. This allows the actors to pick up new work (orders, beverages) as soon as possible.

Event-driven microservices

Let’s see how this relates to microservices. Consider the following diagram.

Microservices

Following the advice given in first article of this series, the two microservices, A and B, use their own database. Each service offers an independent API and can run autonomously.

What happens if a caller orders a coffee by POST-ing to /orders on microservice A (I) - as in the following illustration? After processing this order, microservice A emits a Coffee Ordered (II) event by publishing it (III) to a queue (Q).

Publishing events

Queues are part of a middleware product used to store and forward events that senders publish. Clients interested in these events subscribe to the queues. The middleware forwards the events to the subscriber.

Example tools are

In essence, this design informs any other interested service of the fact that another coffee was ordered.

Note the details here. We are talking about an event that happened in the past - an unchangeable fact. The coffee was ordered, that’s that. Nothing can be done to change this. We can cancel that order. But that would not change the original fact. Rather a new Coffee Order Canceled event would be needed.

Events encapsulate state changes. In our case, the event might look like this

{
  “eventId”: “31da4a50-06a5-4dec-81b2-9390862bd8d5”,
  “eventType”: “Coffee Ordered”,
  “payload”: {
       “product”: “Flat White”,
       “servingSize”: “Large”,
       ...
  }
}

Implementation details aside, most approaches define events similarly:

  • eventId: a unique identifier for the event
  • eventType: the kind of event represented. The event type defines the semantics of the event. We expect different behavior on Coffee Ordered in contrast to a User Onboarded event.
  • payload: the actual data that this event encapsulates. The concrete data depends on the event-type.

An event may have additional attributes, like the type of aggregate or business object it refers to. But since we only want to skim the topic of EDAs, we’ll not go into those aspects. Also, we do not go into topics like versioning, proper validation, schemas, and so on. Refer to the extended references below to learn more. Let’s agree that event design is at least as complex as API design - because that’s what it is, an API.

Going back to our example. Microservice A has published the fact that a customer ordered a coffee. The next diagram shows how microservice B receives and reacts to this fact. B subscribes to events of the type Coffee Ordered (I). When A publishes the event, B receives the event and can react to it. In our case, B could start brewing the ordered coffee and store this information in its private database (II).

Async example

After B has finished brewing the coffee, it publishes a new event Coffee Brewed. The process is the same as for microservice A. The following illustration shows the queue, which now contains both events: the Coffee Ordered and the Coffee Brewed. Other services could use these events to trigger processes like serving the coffee or payment.

Async example continued

Note, that the services must remember which events they already received and processed. This is also out of scope for this article but should be mentioned nevertheless. The whole area of at-least-once-delivery is covered by books and articles.

With this basic example in place, let’s turn to some of the implications.

Implications of Event-Driven-Architectures

To understand the system-level implications of adopting an EDA, let’s refer to the example above and see what an EDA means for our coffee shop. We’ll start with more or less obvious technical implications. Afterward, we’ll look at implications that impact the business side, too.

Keeping cracks from spreading

Cracks…

We have two services in our example. One for ordering a coffee (A) and one for brewing coffees (B). Suppose service A fails. We cannot submit new coffee orders. In a classic synchronous design, this would mean a stand-still. No new orders, no coffee brewed.

But in our event-driven case service A and service B are independent. Service B can continue to brew coffee for all already submitted orders, even if A fails.

This means that customers already expecting coffee will get coffee. A tremendous win! While we serve customers, service A can be fixed without impacting any other business area.

This approach improves the resilience of our system. We can survive partial failures and allow failing parts to heal and come back into service.

Elasticity

A shopping queue

If we decouple systems using events, we are also able to design elastic systems. We call this elasticity: a service scaling more or less transparently with the amount of work.

Imagine a shopping mall. If many customers shop at the same time, then we need more cashiers. But if only a few people are shopping, we might get away with a single one.

The same is possible for our microservice.

Picking up a coffee order takes a couple of seconds, but preparing coffee might take a minute or two. So, we can use a single instance of our microservice A, that accepts the orders. If there are only a few orders in the order queue, we can rely on a single instance of microservice B responsible for brewing coffee.

No load

But as the number of orders increases, we are able to scale microservice B, as illustrated by the following diagram.

Yes load

This is called horizontal scaling. We increase our scale by adding more workers instead of more CPU to a single worker - vertical scaling. See this article for more details.

Besides, the queue-approach protects microservice B from getting overwhelmed by the workload. Since the architecture allows microservice B to indicate whether or not it is ready for more work, we get throttling for free. Service B decides if it is ready to pick up new orders, or not.

Being able to scale elastically must be designed into the services. We don’t go into further details here but refer to this document for more information.

Dealing with late-comers

Things are clear if our system landscape is up and running. If a coffee is ordered we inform the world about this fact and things keep moving.

But what if a new service starts or an existing one re-starts? How would that service know which coffees were ordered? How would that service know about the state of the world?

One way could be to keep all events that were ever emitted. If a new service starts, it could read all events from start to finish and finally, it would know the state of the world.

Another approach could be to allow newcomers to query the other systems.

“Order system, please tell me about all orders”. Or in tech-speak: GET /orders. Adopting a protocol like AtomPub may be a good idea in this case. Whichever approach we choose, the matter is more complicated than it might seem at first.

The newcomer cannot “just” consume the events.

If must remember (somehow) which events already led to side effects. E.g., which orders were already served. And depending on the number of events, re-reading every single event might not be feasible.

The point is, that we have to deal with this question right from the start. Designing this into a system after the fact can be a very daunting task.

Complexity and error handling

We already got an idea that EDAs, with all their advantages, introduce complexity and new error scenarios. In non-EDA systems, errors might result in exceptions. These can be caught and handled.

Not so if we rely on events. We have a system without a complete call stack.

Suppose service A submitted a Coffee Ordered event. But service B never received it - maybe service B is broken, who knows. Service A has no direct way to check what happened with the order. We could monitor the queues and check any lagging events. We could set up a “Business Probe” which checks if every order is picked up and served within 5 minutes. And so on.

The point is: this has to be designed into the services. Increasing complexity. Making debugging harder.

Let’s look at another scenario. What if service B received the same event twice?

Most messaging systems offer at-least-once delivery. In a nutshell, this means, the messaging system guarantees the delivery of every message. But due to some complex details, that I will gladly skim over here, most systems do not guarantee exactly-once-delivery.

This means that our services must be able to deal with this. One way is idempotency. If a service is called with the same request, then the result is the same. We could send the same order 10 times, but still only serve a single coffee.

Again, our systems must be designed to deal with this. Again increasing complexity.

However, one could argue, that EDAs force us to deal with these cases explicitly. What do I mean by that? What does it mean, if we call another service using, e.g., HTTP POST and we do not get any response?

Some questions could be:

  • did the other service get our request, but fail to answer?
  • was our request processed or not?
  • do we have to resend the request?

But these cases are “hidden” in the code. They are not part of the architecture. Events allow us to make this hidden behavior explicit.

Behavior as a first-class citizen

Finally, let’s look at the impact on design and especially on non-technical people. If we rely on events and architect our systems using events, then we make the processes explicit and understandable.

You do not have to be an engineer to understand what a Coffee Ordered event means.

This allows us to discuss our systems architecture with people that are not engineers. We pulled the usually hidden dynamics of our systems to the surface. This is a huge advantage when discussing changes to processes or use cases.

“Whenever a coffee finishes brewing, a sound should play informing the staff”

We can see the events flying around.

We can extend our systems without major interruptions. We can add new consumers to the systems landscape that react to events and trigger new business processes. These are some of the reasons that methods like Event storming and Event modeling are pretty successful and popular.

Summary

An event-driven architecture is a very powerful way to design a system landscape. It allows us to design truly independent systems, that are resilient and elastic.

Designing processes to fit this asynchronous approach is often very beneficial and - using a method like Event Modeling - not as hard as it might seem.

But we have to be aware of the consequences. New patterns emerge. Dealing with errors is different. And we did not even discuss eventual-consistency…maybe I will share my thoughts on this at a later time.

Keep in mind, that refactoring a landscape into an event-driven one is possible. And it need not be a big-bang. We can go one system at a time.

End of the series - finally

This took longer than expected. But here we are. Four typical approaches for dealing with data and service dependencies. There are many more, for sure. But we have to start somewhere.

I started the series back in the day based on an actual conversation with some client’s architect. There was so much confusion. Everything was either super-easy “let’s just do replication” or impossible “we cannot use events, everything must be super-consistent everywhere”.

Turns out, that nuance matters. Context matters. Nothing is always a good solution.

Here is a TL;DR of the approaches for the lazy reader:

  • Sharing a database: easy to start with but can lead to complex and opaque inter-dependencies between teams and services.
  • Synchronous calls: easy to start with and familiar to most engineers. This can lead to a fragile web of services without any resilience or possibility for graceful degradation.
  • Replication: Good approach for refactoring a landscape into autonomous systems. Data governance may be a challenge as can be the volume of replicated data.
  • Event-driven architecture: a proven and flexible architecture for microservices. This can lead to resilient and elastic landscapes that capture business processes effectively. You must be willing to learn new patterns and rethink your design. Especially error handling requires some thought.

I cannot state a clear winner. As mentioned above. Context matters. What works for one project, might not work for another.

Further material and things to check out

More on event-sourcing, but also relevant for EDAs:

On Event-driven architectures:

On system resilience and design:

If you want to go down the rabbit hole of exactly-once-delivery:

And if you have time, I ranted about events, event-sourcing at Devoxx a while back. Watch it here at your own risk.

Please share your experiences and approaches. I am keen to learn about different ways to tackle these problems.