8 April 2021

Be aware: AWS Lambda Deployments Fail on Isolation

Microservice VS Monolith

In a time where everything needs to be a Microservice, It’s easy to forget the good old Monolith. Don’t get me wrong, I like a Microservice architecture and its benefits, but it surely also adds complexity into the equation. Even worse, I have seen people being addicted to splitting applications into smaller pieces for no reason, only to add extra complexity without value.

Looking back to the Monolith, some things were definitely a lot easier, deployments to name one. In the era of the Monolith, we only had one fat artefact to maintain. We had a single artefact and a single version number, making it easy to tell what exactly was running. In a Microservice architecture, however, things are quite different…

A word on Isolation

If you’re familiar with databases you’re most likely familiar with the term ACID. If you never heard of it I can highly recommend reading the whole wiki page.

From Wikipedia:

Isolation: Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.

What has ACID to do with AWS Lambda?

While ACID is mainly used in the context of databases, the concept can be applied in other contexts too, on AWS Lambda for example. In this case, “Isolation” regarding Lambda deployments is quite interesting to look at.

At first glance, deploying a Serverless Lambda application seems pretty straight forward. It feels like deploying a single artefact in a single transaction. Don’t be fooled, while deploying Lambdas a time window exists where different Lambda versions can run side by side (both old and new). If you’re unaware of this, a deployment could end up pretty messy.

Deploying a Monolith

Looking back at the monolith, it’s easy to imagine a deployment as flipping one big switch. Making a simple transition from one version to the next in a split second.

One big switch

Deploying a Serverless Application

Compare this with a Serverless application that consists of a fleet of Lambdas. Deploying a serverless application is like flipping a collection of little switches instead of a single one. This is where the shoe pinches, to flip a collection of switches, CloudFormation only seems to have one hand, so flipping them all at once is impossible.

Switchpanel

AWS Lambda deployments: no guarantee on Isolation

It’s pretty easy to see how CloudFormation deals with Isolation and Lambda deployments yourself. If you update a CloudFormation stack changing multiple Lambdas at once, you can see that each function will be updated separately and often at a slightly different time.

But even if you rely on other tools or if you deploy a single Lambda there’s no guarantee on isolation!

From: Serverless Architectures with AWS Lambda

Parallel version invocations – Updating an alias to point to a new version of a Lambda function happens asynchronously on the service side. There will be a short period of time that existing function containers containing the previous source code package will continue to be invoked alongside the new function version the alias has been updated to.

Jawdrop

Take a minute to let this sink in…

So, no, one fat lambda won’t help you out either 😉

Is this a problem?

Is this a big deal or not? It depends, if the application is only handling a few requests a day then chances are small this issue would ever cause you trouble. However, the more traffic the application is processing, the higher the chance this could hurt you. Obviously, also the number of Lambdas and daily deployments will have an impact on this.

Another downside of Isolation failures is that these are often hard to reproduce. Add the fact that the problems they cause are hard to trace back and to pinpoint.

Anyway, this shouldn’t scare you for the majority of code changes. But I would be very cautious about contract changes that ripple through multiple lambdas. Database schema changes for example, isolation failures on those can easily result in corrupted data or data loss.

How to tackle this problem?

The remediation for this problem is given by AWS in the same paragraph about “Parallel version invocations”:

… It’s important that your application continues to operate as expected during this process. An artefact of this might be that any stack dependencies being decommissioned after a deployment (for example, database tables, a message queue, etc.) not be decommissioned until after you’ve observed all invocations targeting the new function version.

In human language: design your Lambdas to support backwards compatibility and rolling updates. This also pleas for small increments. I know, that’s easier said than done.

So keep backward compatibility always in mind when building Serverless applications with AWS Lambda. And always challenge yourself on big changes to contracts and data structures!

Enjoy and until next time!