Talos is a minimal Kubernetes OS that's quickly gaining popularity because of its ease of use and strong focus on security by default. It has already been deployed in production by a significant number of companies and has reduced cost, maintenance time and operation complexity for those that use it. In this article, we'll dive into the use cases of Talos and when you should or shouldn't use it.
"Talos Linux is Linux designed for Kubernetes - secure, immutable and minimal."
In other words, Talos is designed from the ground up to run Kubernetes. Stripping away unnecessary components streamlines the Kubernetes experience and simplifies maintenance and setup through straightforward API calls. It was first released as an alpha in 2018.
It’s an open-source project from Sidero Labs. They also offer a service for provisioning clusters completely hands-off called Sidero Omni: a graphical user interface for managing clusters and machines within your cluster. This allows complete hands-off provisioning of worker nodes.
It is already used for maintaining Kubernetes clusters at companies like Equinix and Nokia. It has enabled them to reduce operational costs (in money and time) and complexity in their environments.
Consider using Talos (or maybe a different immutable k8s OS) whenever you are managing Kubernetes clusters yourself, including the OS running the cluster. It makes it significantly easier to run and manage Kubernetes, especially if you are running on bare metal. It eliminates all host-level dependencies and operational costs of maintaining a full operating system. Talos forces you to think about your hosts as cattle, never as pets. This might be annoying to start with, but in the end, forces you to engineer your applications in a more cloud-native way, which also results in more stable deployments.
If your applications cannot handle data loss and you don’t have a good backup strategy. Talos will wipe entire disks during upgrades if not instructed not to, which could lead to data loss if you're not careful.
Some users prefer more control to allow hyperparameter tuning, like data scientists and machine learning engineers, to squeeze the most performance out of the hardware.
If your organization only allows the use of licensed or supported operating systems like Red Hat Enterprise Linux or Suse Linux Enterprise Server it can be a hard “no” to use a custom OS like Talos Linux.
It can look daunting to start with Talos in comparison to running a regular Linux distribution with a Kubernetes cluster on top. But because Talos removes a lot of the moving parts underneath Kubernetes it removes quite a bit of operating overhead. The immutable nature of Talos makes maintenance simpler and reasoning about the state of your cluster becomes trivial because there can be no configuration drift.
The root filesystem on Talos is mounted read-only and all host-level packages like shells and ssh are removed. It runs entirely from an in-memory SquashFS without persisting anything, which means that every reboot is a clean start.
OS upgrades are handled by an API call and will, by default, wipe any storage in the EPHEMERAL partition. It uses an A-B image scheme, which means that the update is first installed separately from the running image, then reboots in the B image and if that fails it will roll back to the A image.
It is still possible to use disks mounted on the nodes for storage. Depending on the workload you could use distributed storage like Rook-ceph, distributed object storage like Minio or native clustering in Postgres with Postgres-operator.
Or if you are running the cloud you can use all the CSI providers that are provided by the cloud provider or go full cloud-native and make sure you don’t need any local storage.
If your workloads are fully based on Kubernetes, Talos makes it a lot easier to manage your cluster and upgrades than having to manage a full OS + upgrades to k8s itself. Having a mutable OS that can break when installing upgrades without a roll back strategy can leave your cluster in a broken state and cost you many hours of debugging. Fixing a broken Talos node should just be as simple as removing it from the cluster and adding a new node. Or issuing a Talosctl rollback command to roll back the affected nodes.
Setting up a highly available k8s cluster with Talos is trivial, while K3S makes it a bit more complicated. Also, Talos forces you to create configuration files that describe your cluster beforehand which, if stored correctly, allows you to easily modify and update your cluster.
K3S on the other hand can be installed with a single curl command, but if you want to change options the curl command can become more complicated and if you don’t store the curl command somewhere upgrades can become a nightmare to figure out because you don’t know what options were used when creating the cluster.
If you have a mixed workload where some services are running as systemd units, some are running as k8s deployments and your applications are not ready for running in a more cloud-native way you probably are better off running K3S on your existing nodes.
If you just want a quick and dirty way of starting a k8s cluster, k3s makes it trivial and allows you to start using it within a couple of minutes.
If your company has hard requirements on the operating system used it might be inevitable to use K3S on top of that provided OS.
Besides Talos, there are several options available, including the cloud-provided Kubernetes offerings like GKE, EKS and AKS. These might be more attractive if you are already locked into a certain cloud provider, or if you want to externalize your operational overhead.
Other options include k3os (now deprecated), Elemental (by Rancher) and Kairos. Elemental might be more attractive to engineers who prefer a graphical user interface because you can use it with the Rancher Manager interface.
Kairos allows you to choose your own (or build) your own underlying operating system image. Which allows you more manual control over certain parameters.
If you’re looking for a way to make it easier to manage your Kubernetes cluster(s), you should really consider Talos. The immutable nature and API-driving management make it trivial to manage and reason about your cluster.
To quickly try it locally take a look at the Talos Quickstart.
Not Talos, but still relevant ;)