Microservices and the Database Collective

The Prime Directive

So, you've taken the plunge and converted your monolith into discrete, single-purpose microservices. You've established a set of boundaries and service ownership that extends to the databases, so each microservice has its own little database. You've even instituted the CQRS model to separate your write operations from your read operations, and only the service that owns the database can actually make changes to it; all other services are afforded read-only permissions through a carefully constructed query library that only exposes what needs to be exposed. So now, you can breathe a sigh of relief: the days of the tightly-coupled components are over. You can do whatever you want with Service A with abandon, knowing that you'll never break Service B in the process.

Unless you change the database schema for Service A. Because, you see, that carefully constructed query library that Service A owns is a critical dependency for Service B. Service B can happily query the Service A database without caring anything about the schema--until it changes. Then that query library explodes like a furious kitten and takes down your services with it.

No problem, you say. That's an easy fix! Just update the query library in Service B and redeploy!

Do you see the problem? If Service B must be updated because of a change to Service A, then you have just violated the microservices prime directive: changes made to one service must never interfere with another service.

We are Database. You will be assimilated. Resistance is futile.

All development shops have their version of The Database. This is where business critical information is stored, on which not just one but multiple components of the software rely. You may have multiple databases already, but you know which one is The Database: it's the one that causes all systems to halt when something breaks or, worse, changes.

It's true that breaking up The Database into smaller, bounded databases relieves some of the pain, but no matter what you do, you're still going to have your mission critical data somewhere, and it's very likely that almost every microservice in your architecture is going to need to query that data at some point in one way or another.

When Martin Fowler and Greg Young describe CQRS, they speak of the concept of separating the responsibility of write operations from read operations. The core benefit of CQRS for them is the ability to scale the write operations independently from the read operations. Thus, although they mention having separate data models and possibly separate databases for the read and write operations, it's not fundamental to the pattern to have the data stored separately.

When Udi Dahan describes CQRS, he does so in the specific context of service boundaries. He discusses the speed and convenience of having your data stored separately in exactly the format that your consuming service needs it to be in, eliminating the need to transform it.

The Data Neutral Zone

In Udi's blog post about CQRS, he is diplomatic about where you should put this read-only store. It could be another table in the master database; it could be an in-memory cache; it could be a document database; it could even be a stored procedure or view or code-based transformation. This last option is essentially what your query library is probably doing for you, which is great, but we've already identified that it is fragile if the master database schema changes. Thus, I will assert that a view, stored procedure, or code map is not a viable solution. By having the read-only data stored separately from the master schema, you can altogether avoid breaking the services that rely on it when you need to update the master schema.

You might be tempted to simply create versioned views or versioned stored procedures in the master database and call that the solution. And it very well might be a decent solution; after all, you're preserving the format of the read-only data for the services that are not yet updated, and you're providing the new format for anything that cares to know about it. One thing haunts me about this solution, however, and, in fact, it haunts me about the query library, as well: who owns that format? Which service does it belong to? Sure, the data belongs to the service that owns the database, but if that service never uses that query format, then does it really own it? Or does the consuming service own it? If the answer is the latter (which I think it is), then what business does that view or stored procedure have in the master database to begin with, let alone several versions of it? I argue that the consuming service should be responsible for determining its own query formats because it knows what it needs, and it shouldn't have to work around the constraints of the master database to get it.

You might still want to keep the query store in a central place because more than one service queries the data in the same way. At least, for now. That might change in the future; Service C might need a slightly different format. No big deal, just create a new version of the store. Now who owns it? It gets more complicated when you start thinking about what to name these query stores. If you simply name it based on the data or format, you'll quickly lose track of which services rely on which store, making it more and more difficult to maintain and later prune. And your highly-tuned sense of service boundaries should be highly offended by the idea of naming the data store to reflect the service(s) that use it. And anyway, you should know by now that reuse is a fallacy, even when it comes to your database. There's really no such thing as a data neutral zone--every service is eventually going to need something different, and there will be conflicts.

There Are Four Databases

You've already agreed at the beginning of this journey that each service should have its own database that it is responsible for updating. The next logical step is to let each service have its own database that it is responsible for querying. It can already freely format and query the data that it owns. Now, let's give it the power to freely format and query the data that it borrows from the other services.

I would hope that you're already using an event-driven or message based architecture with your microservices, so you're already familiar with the concept of eventual consistency and stale data. Indeed, if your commands are separate from your queries, and you query the data before a command is processed, you'll get stale data. You're used to it, your users are used to it (or, even better, protected from it), and your software is used to it. So your service won't mind if the data it queries from its own copy of the data is a little out of date. It will get there eventually. In fact, you have the power to decide whether it really even matters if the data is up-to-the-minute, up-to-the-hour, or otherwise. In some of Google's data-hungry applications, you don't get accurate and complete data for 24 hours. Yes, you'll have to write or implement some mechanism to copy the data over, but it might be easier than you think. You're probably already publishing events after commands are successfully written. Have your consuming services listen for those and save whatever data the events contain. Or use a scheduled job to copy the data over in bulk.

But, you say, we have exact copies of the same data in more than one place! I'll quote Udi here: "So?" In all likelihood, it won't be an exact copy, since you'll have taken the time to store it in exactly the format your consuming service needs. Whether this means it is flattened or aggregated, it probably looks a lot different from what you started with. And even if it is exactly the same, what does it matter? Storage is cheap, especially when compared to the alternative: the cost of your time unwinding the mess you made with your spaghetti services. And remember that you don't need to use a relational database to store this data; it can be a document database, an in-memory cache, or whatever works for you--as long as the consuming service retains ownership.

This method of creating data stores that belong to the services that query them nets you a few things. First, you've truly eliminated the dependency on The Database, and you'll never have to redeploy another service because you've changed the schema for another. Second, you've given control to the consuming service to transform and manipulate the data however it needs to without worrying about how it might affect other services, and your developers know exactly which service is using that format without combing through every repository to find a reference to it. Third, you've lightened the load on The Database. Less contention is always a good thing!   So, stand and proudly proclaim, "There are four databases!" Or five. Or six.