Running large numbers of containers to deploy an application requires a rethink of the role of the operating system.
Google’s Container-Optimised OS and AWS’s Bottlerocket take the traditional virtualisation paradigm and apply it to the operating system, with containers the virtual OS and a minimal Linux fulfilling the role of the hypervisor.
Various flavours of Linux optimised for containers have been around for a few years and have evolved ever smaller footprints as the management and user-land utilities moved to the cluster management layer or to containers.
These container-optimised operating systems are ideal when you need to run applications in Kubernetes with minimal setup and do not want to worry about security or updates, or want OS support from your cloud provider.
Container OSs solve several issues commonly encountered when running large container clusters, such as keeping up with OS vulnerabilities and patching potentially hundreds of instances, updating packages while dealing with potentially conflicting dependencies, degraded performance from a large dependency tree, and other OS headaches.
The job is challenging enough with a few racks of servers and nearly impossible without infrastructure support when managing thousands.
Bottlerocket is purpose-built for hosting containers in Amazon infrastructure. It runs natively in Amazon Elastic Kubernetes Service (EKS), AWS Fargate, and Amazon Elastic Container Service (ECS).
Bottlerocket is essentially a Linux 5.4 kernel with just enough added from the user-land utilities to run containers. Written primarily in Rust, Bottlerocket is optimised for running both Docker and Open Container Initiative (OCI) images. There’s nothing that limits Bottlerocket to EKS, Fargate, ECS, or even AWS. Bottlerocket is a self-contained container OS and will be familiar to anyone using Red Hat flavours of Linux.
Bottlerocket integrates with container orchestrators such as Amazon EKS to manage and orchestrate updates, and support for other orchestrators can be adding by building variants of the operating system to add the necessary orchestration agents or custom components to the build.
Bottlerocket’s approach to security is to minimise the attack surface to protect against outside attackers, minimise the impact that a vulnerability would have on the system, and provide inter-container isolation.
To isolate containers, Bottlerocket uses container control groups (cgroups) and kernel namespaces for isolation between containers running on the system. eBPF (enhanced Berkeley Packet Filter) is used to further isolate containers and to verify container code that requires low-level system access.
The eBPF secure mode prohibits pointer arithmetic, traces I/O, and restricts the kernel functions the container has access to.
The attack surface is reduced by running all services in containers. While a container might be compromised, it’s less likely the entire system will be breached, due to container isolation. Updates are automatically applied when running the Amazon-supplied edition of Bottlerocket via a Kubernetes operator that comes installed with the OS.
An immutable root filesystem, which creates a hash of the root filesystem blocks and relies on a verified boot path using dm-verity, ensures that the system binaries haven’t been tampered with. The configuration is stateless and /etc/ is mounted on a RAM disk.
When running on AWS, configuration is accomplished with the API and these settings are persisted across reboots, as they come from file templates within the AWS infrastructure. You can also configure network and storage using custom containers that implement the CNI and CSI specifications and deploy them along with other daemons via the Kubernetes controllers.
SELinux is enabled by default, with no way to disable it. Normally that might be a problem, but in the container OS use case relaxing this requirement isn’t necessary. The goal is to prevent modification of settings or containers by other OS components or containers. This security feature is a work in progress.
Bottlerocket open source
The Bottlerocket build system is based on Rust, which is fine considering there’s nothing to build except for support for Docker and Kubernetes. Rust just broke into the top 20 programming languages and seems to be gaining traction due to its C++ like syntax and automatic memory management. Rust is licensed under the MIT or Apache 2 license.
Amazon does a good job of leveraging GitHub for their development platform, making it easy for developers to get involved. The toolchain and code workflow will be familiar to any developer, and by design end users are encouraged to create variants of the OS.
This is to cater to support for multiple orchestration agents. In order to keep the OS footprint as small as possible, each Bottlerocket variant runs on a specific orchestration plane. Amazon includes variants for Kubernetes and local development builds. You could, for example, create your own update operator or your own control container by changing the URL of the container.
Managing Bottlerocket instances
Bottlerocket isn’t intended to be managed with a shell. Indeed, there is little of the OS that requires management, and what is required is accomplished by the HTTP API, the command-line client (eksctl), or the web console.
To update you need to deploy an update container onto the instance. See the bottlerocket-update-operator (a Kubernetes operator) on GitHub. Bottlerocket accomplishes single-step updates using the “two partition pattern,” where the image has two bootable partitions on disk.
Once an update has been successfully written to the inactive partition, the priority bits in the GUID partition table of each partition are swapped and the “active” and “inactive” partitions roles are reversed. Upon reboot, the system is upgraded, or, in the event of an error, rolled back to the last known-good image.
There are no packages that can be installed, only containers, and updates are image based, as in NanoBSD and other embedded operating systems. The reason behind this decision was explained by Jeff Barr, AWS evangelist:
Instead of a package update system, Bottlerocket uses a simple, image-based model that allows for a rapid and complete rollback if necessary. This removes opportunities for conflicts and breakage, and makes it easier for you to apply fleet-wide updates with confidence using orchestrators such as EKS.
To access a Bottlerocket instance directly you run a “control” container, which is managed by a separate instance of containerd. This container runs the AWS SSM agent so you can execute remote commands or start a shell on one or more instances. The control container is enabled by default.
There is also an administrative container that runs on the internal control plane of the instance (i.e. on a separate containerd instance). Once enabled, this admin container runs an SSH server that allows you to log in as ec2-user using your Amazon-registered SSH key. While this is useful for debugging, it is not really suitable for making configuration changes due to the security policies of these instances.
Google Container-Optimised OS
Container-Optimised OS is a Google-maintained operating system based on the open source Chromium OS project. Like Bottlerocket, Container-Optimised OS is an image-based operating system, optimised for running Docker containers in Google Compute Engine VMs.
Container-Optimised OS addresses similar needs for updates, security, and easy management. It does not run outside of the Google Cloud Platform, though developers can run it on KVM for testing. Only Docker-based images are supported.
Elastic workload scaling is all the rage in devops, and one of the stated goals of Container-Optimised OS is rapid scaling. Boot-up of the minimal, image-based OS is fast, and configuration at scale is managed with a combination of cloud-init and Google’s Cloud SDK. This means that application services can be ramped up quickly in response to spikes in demand and workload changes.
Container-Optimised OS security
One of the most important rules of security is to reduce your attack surface. Container-Optimised OS does this by moving all services out of the OS user/system space and into containers. Therefore, the bare OS has the minimum number of packages installed to support container management, and containers manage their own dependencies.
The kernel also features security-related improvements, such as Integrity Measurement Architecture (IMA-measurement), IMA-audit, Kernel Page Table Isolation, and a few Linux Security Modules taken from Chromium OS. If applications require it, fine-grained security policies can be added via Seccomp and AppArmor.
The default settings for a Container-Optimised OS instance take a security-minded stance as well, which make securing a large cluster easier. For example, having no accessible user accounts and a firewall setting that drops all connections except SSH reduces the attack surface.
Access to the instance is managed through Google’s IAM roles instead or by adding and removing SSH keys in the instance metadata. Password-based log-ins are not allowed. Two-factor authentication is an option.
Security is also implemented at the filesystem level. For example, Container-Optimised OS uses a read-only root filesystem that is verified by the kernel at boot, preventing any attacker from making permanent local changes. While this is good for security, it does make configuration a challenge.
To enable configuration, the OS is set up such that /etc/ is writeable, but ephemeral, so at each reboot the OS configuration is freshly rebuilt.
Container-Optimised OS leverages Google’s best practices and infrastructure to build and deploy images. The kernel and package source code for the operating system are built from Google-owned code repositories, and any bugs or vulnerabilities can be patched and rolled out via the auto-upgrading mechanism.
The auto-upgrading feature, enabled by default, keeps nodes in the cluster up-to-date with the cluster master version. This both improves security and reduces maintenance overhead. Google also provides vulnerability scanning, so if a vulnerability is detected in Container-Optimised OS, a patch is automatically rolled out when available.
Container-Optimised OS open source
As part of the Chromium OS project, Container-Optimised OS is open source, though there is no reason to build it yourself except for experimentation. Unlike Bottlerocket, Container-Optimised OS doesn’t envision a need for customers to build and deploy customised images on a cluster, and given the reliance on Google’s infrastructure, there’s no reason you’d want to.
Building Container-Optimised OS requires the Chromium toolchain and scripts, which are unique to Google. These development images do allow user shell access and are primarily designed for Google engineers to build, test, and debug the system. The images can be run using KVM or imported into a compute engine instance.
Managing Container-Optimised OS instances
Google Container-Optimised OS does not include a package manager, but you can install additional tools using the CoreOS Toolbox, which launches a container to let you bring in your favourite debugging or admin tools.
In most situations a Container-Optimised OS instance will be run as part of a Kubernetes-managed cluster. For experimentation, you can define a single image and run it on a GCE instance using the Cloud Console or gcloud command-line tool and then SSH into it like any other GCE instance. Public container registries are supported in the base image, so you can get started right away with your favourite Docker images.
Google includes a few nice features to help with production deployments. One of those is the Node Problem Detector, used to monitor the health of Container-Optimised OS instances. Using Google Cloud Monitoring you can see capacity and error reports and visualise the health of the cluster using the Google Operations dashboard.
Time is synchronised with Linux’s systemd-timesyncd. It’s a bit unusual to use a package that synchronises with SNTP, especially if you have long-running instances that need fine-grained control of time, but you can always install the full version of NTPd in a container if you need it.
Upgrades are automatically applied in most scenarios, and there are three rolling release channels to choose from: dev, beta, and stable. These channels provide a window into the feature pipeline and allow for a rolling upgrade of the cluster.
Typically, a small percentage of your cluster will be on dev, a bit more on beta, and the majority on stable. This reduces the risk of a cluster-wide problem being encountered.
Auto-updates take place using an active/passive root partition, where one partition is “live” and the other a backup. Image updates from the dev/beta/stable channels are downloaded to the passive partition and the boot manager selects the newest version at boot time.
Should an error be encountered, the system is booted from the old partition. Updates can be manually controlled by a CLI interface, but most of the time auto-update is used.
Container OSs built for cloud
Container-optimised operating systems aren’t new. I previously reviewed CoreOS, RancherOS, Red Hat Atomic, and others. I think we’re at the end game of this line of OS development, where the OS is just a part of the whole cloud operating system, much like a shared library provides specific functionality to a host operating system.
The OS is part of the background infrastructure and means developers can focus on their applications instead of how they’ll be run. Both Bottlerocket and Container-Optimised OS do this well. Both are ideally suited for the cloud they were developed for.
AWS’s Bottlerocket incorporates many of the best ideas from the predecessors, and adds support for multiple cloud environments and container orchestrators, as well as the ability to create variants if your use cases require it. Bottlerocket will be available in GA form sometime in 2020.
Google’s Container-Optimised OS is closer to the microVM end of the spectrum (like the Firecracker technology under AWS Fargate) than Bottlerocket. Like many Google technologies, Container-Optimised OS takes an opinionated stance on how things should be done, and this is often a good thing.
However, if you’re looking at a multi-cloud strategy, then Container-Optimised OS is a roadblock, not an advantage. Most people are looking at multi-cloud and avoiding vendor lock-in. If deploying to multiple clouds is in your future, Bottlerocket would be a better choice.