Many workloads nowadays involve many systems that operate concurrently. This ranges from microservice fleets to workflow orchestration to CI/CD pipelines. Sometimes it’s important to coordinate these systems so that concurrent operations don’t step on each other. One way to do that is by using distributed locks that work across multiple systems.
Distributed locks used to require complex algorithms or complex-to-operate infrastructure, making them expensive both in terms of costs as well as in upkeep. With the emergence of fully managed and serverless cloud systems, this reality has changed.
In this post I’ll look into a distributed locking algorithm based on Google Cloud. I’ll discuss several existing implementations and suggest algorithmic improvements in terms of performance and robustness.
Update: there is now a Ruby implementation of this algorithm!
Distributed locks are useful in any situation in which multiple systems may operate on the same state concurrently. Concurrent modifications may corrupt the state, so one needs a mechanism to ensure that only one system can modify the state at the same time.
A good example is Terraform. When you store the Terraform state in the cloud, and you run multiple Terraform instances concurrently, then Terraform guarantees that only one Terraform instance can modify the infrastructure concurrently. This is done through a distributed lock. In contrast to a regular (local system) lock, a distributed lock works across multiple systems. So even if you run two Terraform instances on two different machines, then Terraform still protects you from concurrent modifications.
More generally, distributed locks are useful for ad-hoc system/cloud automation scripts and CI/CD pipelines. Sometimes you want your script or pipeline to perform non-trivial modifications that take many steps. It can easily happen that multiple instances of the script or pipeline are run. When that happens, you don’t want those multiple instances to perform the modification at the same time, because that can corrupt things. You can use a distributed lock to make concurrent runs safe.
Here’s a concrete example involving a CI/CD pipeline. Fullstaq Ruby had an APT and YUM repository hosted on Bintray. A few months ago, Bintray announced that they will shutdown in the near future, so we had to migrate to a different solution. We chose to self-host our APT and YUM repository on a cloud object store.
APT and YUM repositories consist of a bunch of .deb and .rpm packages, plus a bunch of metadata. Package updates are published through Fullstaq Ruby’s CI/CD system. This CI/CD system directly modifies multiple files on the cloud object store. We want this publication process to be concurrency-safe because if we commit too quickly then multiple CD/CD runs may occur at the same time. The easiest way to achieve this is by using a distributed lock, so that only one CI/CD pipeline may operate on the cloud object bucket concurrently.
Distributed locks used to be hard to implement. In the past they required complicated consensus protocols such as Paxos or Raft, as well as the hassle of hosting yet another service. See Distributed lock manager.
In a more recent past, people started implementing distributed locks on top of other distributed systems, such as transactional databases and Redis. This significantly reduced the complexity of algorithms. But operational complexity was still significant. A big issue is that these systems aren’t “serverless”: operating and maintaining a database instance or a Redis instance is not cheap. It’s not cheap in terms of effort. It’s also not cheap in terms of costs: you pay for a database/Redis instance based on its uptime, not based on how many operations you perform.
Luckily, there are many cloud systems nowadays which not only provide the building blocks necessary to build a distributed lock, but are also fully managed and serverless. Google Cloud Storage is a great system to build a distributed lock on. It’s cheap, it’s popular, it’s highly available and it’s maintenance-free. You only pay for the amount of operations you perform on it.
One of the problems that distributed locking algorithms need to solve, is the fact that participants in the algorithm need to communicate with each other. Distributed systems may run in different networks that aren’t directly connected.
Another problem is that of concurrency control. This is made difficult by communication lag. If two participants request ownership of a lock simultaneously, then we want both of them to agree on a single outcome even though it takes time for each participant to hear the other.
Finally, there is the problem of state consistency. When you write to a storage system, then next time you read from that system you want to read what you just wrote. This is called strong consistency. Some storage systems are eventually consistent, which means that it takes a while before you read what you just wrote. Storage systems that are eventually consistent are not suitable for implementing distributed locks.
This is why we leverage Google Cloud Storage as both a communication channel, and as a “referee”. Everyone can connect to Cloud Storage, and access control is simple and well-understood. Cloud Storage is also a strongly consistent system and has concurrency control features. This latter allows Cloud Storage to make a single, final decision in case two participants want to take ownership of the lock simultaneously.
Every Cloud Storage object has two separate generation numbers.
When you perform a modification operation, you can use the x-goog-if-generation-match/x-goog-if-metageneration-match headers in the Cloud Storage API to say: “only perform this operation if the generation/metageneration equals this value”. Cloud Storage guarantees that this effect is atomic and free of race conditions. These headers are called precondition headers.
The special value 0 for x-goog-if-generation-match means “only perform this operation if the object does not exist”.
This feature — the ability to specify preconditions to operations — is key to concurrency control.
Several implementations of a distributed lock based on Google Cloud Storage already exist. A prominent one is gcslock by Marc Cohen, who works at Google. Gcslock leverages the x-goog-if-generation-match header, as described in the previous section. Its algorithm is simple, as we’ll discuss in the next section.
Most other implementations, such as gcs-mutex-lock and gcslock-ruby, use the gcslock algorithm though with minor adaptations.
I’ve been able to find one implementation that’s significantly different and more advanced: HashiCorp Vault’s leader election algorithm. Though it’s not functionally meant to be used as a lock, technically it boils down to a lock. We’ll discuss this algorithm in a later section.
The gcslock algorithm is as follows:
x-goog-if-generation-match: 0
.
This algorithm is very simple. It is also relatively high-latency because Cloud Storage’s response time is measured in tens to hundreds of milliseconds, and because it utilizes retries with exponential backoff. Relative high latency may or may not be a problem depending on your use case. It’s probably fine for most batch operations, but it’s probably unacceptable for applications that require pseudo-realtime responsiveness.
There are bigger issues though:
This sort of network-delay-based problem is even documented in the Cloud Storage documentation as a potential risk.
One way to avoid stuck locks left behind by crashing processes, is by considering locks to be stale if they are “too old”. We can use the timestamps that Cloud Storage manages, which change every time an object is modified.
What should be considered “too old” really depends on the specific operation. So this should be a configurable parameter, which we call the time-to-live (TTL).
What’s more, the same TTL value should be agreed upon by all processes. Otherwise we’ll risk that a process thinks the lock is stuck even though the owner thinks it isn’t. One way to ensure that all processes agree on the same TTL is by configuring them with the same TTL value, but this approach is error-prone. A better way is to store the TTL value into the lock object.
Here’s the updated locking algorithm:
x-goog-if-generation-match: 0
.
x-goog-if-generation-match: [generation]
. Specifying this header is important, because if someone else takes the lock concurrently (meaning the lock is no longer stale), then we don’t want to delete that.What’s a good value for the TTL?
gsutil
CLI, then you should be aware that gsutil takes a few seconds to start. Thus, the TTL should be at least a few ten seconds.As a general rule, I’d say that a safe TTL should be in the order of minutes. It should be at least 1 minute. I think a good default is 5 minutes.
If an operation takes longer than the TTL, then another process could take ownership of the lock even though the original owner is still operating. Increasing the TTL addresses this issue somewhat, but this approach has drawbacks:
A better approach is to refresh the object’s update timestamp regularly as long as the operation is still in progress. Keep the TTL relatively short, so that if the process crashes then it won’t take too much time for others to detect the lock as stale.
We implement refreshing via a PATCH object API call. The exact data to patch doesn’t matter: we only care about the fact that Cloud Storage will change the update timestamp.
We call the time between refreshes the refresh interval. A proper value for the refresh interval depends on the TTL. It must be much shorter than the TTL, otherwise refreshing the lock is pointless. Its value should take into consideration that a refresh operation is subject to network delays, or even local CPU scheduling delays.
As a general rule, I recommend a refresh interval that’s at most 1/8th of the TTL. Given a default TTL of 5 minutes, I recommend a default refresh interval of ~37 seconds. This recommendation takes into consideration that refreshes can fail, which we’ll discuss in the next section.
Refreshing the lock can fail. There are two failure categories:
How should we respond to refresh failures?
Upon encountering network problems, there’s a chance that the failure is just temporary. So we should retry a couple of times. Only if retrying fails too many times consecutively do we abort the operation.
I think retrying 2 times (so 3 tries in total) is reasonable. In order to abort way before the TTL expires, the refresh interval must be shorter than 1/3rd of the TTL.
When we conclude that we should abort the operation, we declare that the lock is in an unhealthy state.
Aborting should happen in a manner that leaves the system in a consistent state. Furthermore, aborting takes time, so it should be initiated way before the TTL expires, and it’s also another reason why in the previous section I recommended a refresh interval of 1/8th of the TTL.
Aborting the operation could itself fail, for example because of network problems. This may leave the system in an inconsistent state. There are ways to deal with this issue:
The lock could be released, or its ownership could change, at any time. Either because of a faulty process or because of an unexpected administrator operation. While such things shouldn’t happen, it’s still a good idea if we are able to handle them somehow.
When these things happen, we also say that the lock is in an unhealthy state.
We make the following changes to the algorithm:
x-goog-if-generation-match: <last known generation number>
header.
x-goog-if-generation-match: <last known generation number>
header, so that we’re sure we’re releasing the lock we owned and not one that was taken over by another process. We can ignore any 412 Precondition Failed errors.HashiCorp Vault is a secrets management system. Its high availability setup involves leader election. This is done by taking ownership of a distributed lock. The instance that succeeds in taking ownership is considered the leader.
The leader election algorithm is implemented in physical/gcs/gcs_ha.go and was originally written by Seth Vargo at Google. This algorithm was also discussed by Ahmet Alp Balkan at the Google Cloud blog.
Here are the similarities between Vault’s algorithm and what we’ve discussed so far:
Notable differences:
If Vault’s leader election system crashes non-fatally (e.g. it detected an unhealthy lock, aborted, then tried again later from the same Vault instance), and the lock hasn’t been taken over by another Vault instance at the same time, then Vault is able to retake the lock instantly.
In contrast, our approach so far requires waiting until the lock becomes stale per the TTL.
I think points 3, 4 and 6 are worth learning from.
HashiCorp Vault’s ability to retake the lock instantly after a non-fatal crash is worthy of further discussion. It’s a desirable feature, but what are the implications?
Upon closer inspection, we see that this feature works by assigning an identity to the lock object. This identity is a random string that’s generated during Vault startup. When Vault attempts to take a lock, it checks whether the object already exists and whether its identity equals the Vault instance’s own identity. If so, then Vault concludes that it’s safe to retake the lock immediately.
This identity string must be chosen with some care, because it affects on the level of mutual exclusion. Vault generates a random identity string that’s unique on a per-Vault-instance basis. This results in the lock being multi-process safe, but — perhaps counter-intuitively — not thread-safe!
We can make the lock object thread-safe by including the thread ID in the identity as well. The tradeoff is that an abandoned lock can only be quickly recovered by the same thread that abandoned it in the first place. All other threads still have to wait for the TTL timeout.
In the next section we’ll put together everything we’ve discussed and learned so far.
Parameters:
Steps:
x-goog-if-generation-match: 0
header.x-goog-if-metageneration-match: [metageneration]
header.x-goog-if-metageneration-match: [metageneration]
header.Parameters:
Steps:
x-goog-if-metageneration-match: [last known metageneration]
header.Parameters:
Every refresh_interval
seconds (until a lock release is requested, or until an unhealthy state is detected):
x-goog-if-metageneration-match: [last known metageneration]
header.Steps:
Distributed locks are very useful for ad-hoc system/cloud automation scripts and CI/CD pipelines. Or more generally, they’re useful in any situation in which multiple systems may operate on the same state concurrently. Concurrent modifications may corrupt the state, so one needs a mechanism to ensure that only one system can modify the state at the same time.
Google Cloud Storage is a good system to build a distributed lock on, as long as you don’t care about latency that much. By leveraging Cloud Storage’s capabilities, we can build a robust distributed locking algorithm that’s not too complex. What’s more: it’s cheap to operate, cheap to maintain, and can be used from almost anywhere.
The distributed locking algorithm proposed by this article builds upon existing algorithms found in other systems, and makes locking more robust.
Eager to use this algorithm in your next system or pipeline? Check out the Ruby implementation. In the near future I also plan on releasing implementations in other languages.
Originally published on joyfulbikeshedding.com