By Chris Monson and Eric Whyne
This post summarizes problems associated with running user-submitted analytic containers in a cloud-based platform for data analytics as well as potential mitigations. Although many tools exist for static code analysis and vulnerability analysis of containers from a trusted source, very little advice or tooling exists which allows safely executing precompiled analytic containers from untrusted or adversarial sources. The packaging of analytics by container, despite the inherent risks, continues to be a theme adopted by data providing platforms and analytic execution environments in general because of the amazing dependency management and efficiencies gained from packaging software in this manner. We are undertaking development efforts to address these gaps; contact us if you are interested in more details. In the meantime, here are some recommendations to make this practice safer using current methods and technologies.
Containers are in many ways insecure by default. There are a combination of factors that must come together simultaneously to give a reasonable expectation of security, and these include:
Images and build parameters
The Linux kernel’s cgroups and namespaces, on which all container technology is built, are intended to allow nearly arbitrary constraints on what a process can see and do. They are, effectively, parameters that affect how a process is executed, and thus differ from a VM or simulator: there is no barrier between the process and the host; the host kernel merely dictates (to the extent possible) how the process interacts with and perceives its operating environment.
Because of this, achieving true security while running containers with unknown contents can be challenging. It is, however, something that people want to do in diverse circumstances, including cloud providers who must run arbitrary containers on shared systems, and thus there are some standard approaches for increasing safety. Assuming that changing the built image is not within the set of acceptable solutions, these fall under several broad categories:
Runtime best practices
We cover each of these in turn, highlighting the benefits and limitations of each approach.
Runtime Best Practices
By using widely acknowledged best practices when running a container, a number of potential exploit routes are closed off.
Restrict access to the docker socket
The docker daemon is responsible for starting processes, and thus is responsible for setting up cgroups and namespaces for the processes that it launches (when you “run a container”, the daemon is really setting up namespaces and cgroups while launching a process based on the settings in the container image file).
The daemon must run as root, but its socket file that the client uses to communicate with it does not need to be owned by root. Because access to the docker daemon grants effective root privileges (after all, it can start arbitrary processes and is running as root), access should be restricted to the socket that communicates with that daemon: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
This includes mounting the socket into containers. They should never have direct access to it.
There are several places online, even from reputable sources that ought to know better, where it is suggested that the docker socket be mounted into a container at runtime. This is, for the same reasons that clients should not run as root, a terrible idea. Any container that can access the docker socket is one privilege escalation bug away from having effective root access on the host, for the same reasons mentioned above. That is the best case scenario: if any misconfiguration of the socket has occurred, or the container’s host-level user has access rights, then that user is now effectively root.
This is commonly justified for containers that need to start other containers as part of their job. If a process requires you to mount the socket within its container for it to work properly, that should be a red flag. The correct way to allow a container to speak to the docker daemon is via TLS. Using the Docker https service allows proper authentication and authorization to be done on a case-by-case basis without exposing effective root privileges to the container.
Container processes should not be running as root; but in a container that is provided from external sources, this is not always possible to enforce. Thus, using the user namespace is recommended in all cases, which maps the in-container root user to an unprivileged user on the host machine.
Containers should always be run with memory and CPU limits. These are off by default, meaning that a runaway container, even if unable to exploit a system to run what it wishes, can still cause the OOM killer to trigger and potentially kill other processes, or to starve them for CPU resources.
Setting these limits helps ensure that containers play nicely with critical system resources and are killed first when attempting to use too much memory (as opposed to important system processes getting killed first).
For secure operation, we need not only to think about how to keep processes contained, but also how to keep sensitive credential information from being misused if it is found. Much of the runtime advice in this section has been devoted to configurations that help to prevent bad actions from being carried out. But, in the event that such an action succeeds, it is also important to limit the amount of damage that it can do, and the value of the secrets that it can exfiltrate.
Best practices dictate that any credentials given to a contained process that allow it to do its work (e.g. keys to access internal services such as a database) be valid for a limited time and with limited scope. This can best be accomplished by using a key management system such as Vault. These systems are responsible not only for storing keys to sensitive resources, but also for rotating those keys and providing ephemeral keys to processes that need them.
Ephemeral keys are an important component of security, because they severely restrict the access a bad process can exploit. If it is noticed that a process is acting badly, its keys can be easily revoked without disturbing the rest of the system, and the keys that it does have can only be used for a limited time.
This is one of the less well-known aspects of security, but is critically important. If the container needs keys, those keys should be temporary and easily revoked. There are sidecar processes that can be used to manage keys in a way that allows containers to use them without changes to the container itself (such as writing them to a shared volume, then issuing a “kill -HUP” to the process to get it to reload its keys).
Rule-based execution, such as seccomp, SELinux, and AppArmor, are another important consideration when running containers. These rules are often omitted, but are needed in order to constrain what a container can do on a system. They are quite flexible and reside at the top of a fairly steep learning curve, but will not be neglected in a serious, security-conscious environment. Additionally, tools exist for reducing the configuration burden.
Running containers by themselves can work fine, but it is actually best to always use some form of orchestration, such as Docker Swarm or Kubernetes. These provide additional network configuration and easier management of processes than the relatively low-level docker command. Most containers need to speak to other containers over a network, and creating such a network with proper isolation by hand is not usually trivial. Swarm and Kubernetes both handle this for you and provide monitoring resources. Thus, a best practice is to always use orchestration of some kind, unless you are really only running one thing.
Other Runtime Considerations
It is usually best to use the --init flag when running a docker container for an external image. This helps ensure that container shutdown handles child process reaping correctly. Without knowing beforehand whether a container intends to fork and manage children, --init causes PID 1 to be a proper init process (defaults to tini) with appropriate signal handling. Signal handling is subtle and easy to get wrong, so many main processes that fork end up getting it wrong. Using an init process helps to at least get child process cleanup right when a container’s main process exits.
Ultimately, the runtime best practices are a means of setting things up to make exploits less likely to succeed. These are only as strong as the underlying kernel and hardware, however (for example, a Spectre exploit can still work from within a container).
Because a containerized process runs on the host kernel with only the kernel’s own cgroup and namespace protections keeping the process contained, there is often a need for greater protection. VM nesting is one approach that can provide an extra layer of security.
There are two basic approaches that people are using, both described below.
To decrease the amount of possible damage a rogue process can cause when escaping a container, one might run docker containers inside of a virtual machine, such as an OpenStack or VMWare instance. This can keep sensitive files or credentials from being easily accessed, and can limit network access depending on the host VM’s configuration.
Virtual machines are much heavier to operate than containers, and they incur a performance penalty, so this is usually only done if the convenience and security outweigh performance or packing efficiency needs.
It is not uncommon to run a Kubernetes cluster over OpenStack, where all of its nodes are virtual. This can limit the extent of damage done by rogue processes, but those processes will still have access to everything in the cluster, so it is not necessarily considered to be a security measure to deploy in this way. It is convenient, but it does not produce extra security, per se.
Systems like KubeVirt and Virtlet allow virtual machines to be treated like containers (KubeVirt works with Kubernetes by augmenting its ability to manage things to include VMs, where Virtlet actually implements the Container Runtime Interface, and thus actually looks like a container, allowing it to be part of a Deployment, for example).
Both approaches allow for VM workloads to be deployed alongside container workloads in the same Kubernetes cluster. In some ways, this means that one does not need to choose between containers and VMs — use containers for things you trust and VMs for things you do not.
Monitoring is a must-have item for any secure container deployment. It allows people and algorithms to notice unusual behavior over the network, with files, and with resource utilization. While not strictly a security measure, blindness is definitely a security liability.
At a minimum, something akin to an ELK stack should be set up to stash and analyze logs from containers, and cluster-wide monitoring of basic resources (network, files, cpu, memory) should be running with high availability dashboards at all times. Breaches will happen, and when they do they should not go unnoticed.
Security scanning is similar to virus scanning: it is the process of looking at a container image’s contents (the static non-running contents) in search of signatures for known-bad processes. For example, a scanner could look for traces of Metasploit, a common component of penetration toolkits. There are other common components, and exploit code makes its way around the Internet quite rapidly once available, often with minimal changes.
Vulnerability scanning is something that large companies are providing to those who build their containers on their platforms (e.g. Google’s registry scanner). These are typically geared toward looking for vulnerable configurations, e.g. a recent one found in the immensely popular Alpine Linux image for Docker (it was set up to permit a man-in-the-middle attack on the package manager).
By knowing that the base images used to compose a container image are up to date with security patches, it becomes easier to trust that attacks will not succeed from outside of the container. Furthermore, containers are usually running in ecosystems with other containers nearby, and if those containers contain vulnerabilities (e.g. a data exfiltration bug on a web server), even applications that do not escape their own containers can potentially wreak havoc by scanning for vulnerabilities in containers within their reach.
Scanning for popular penetration-testing software as well as base image system vulnerabilities can provide a degree of confidence that the container ecosystem has a small attack surface that is not known to be easily exploitable.
A recent approach to securing containers is to intercept all syscalls and only forward them to the host kernel if they are allowed. This is accomplished by creating a thin user-space kernel on which the container runs. It replaces the container’s runtime (a configuration in the docker service) with its own runtime that sets everything up. In the case of gVisor, for example, the runtime kernel forwards many syscalls (when allowed), but file and network access are sent to separate processes that employ more in-depth protection and filtering than a kernel has access to.
These approaches are still relatively nacent and can sometimes impose difficult-to-meet constraints (for example gVisor only works on 32-bit binaries, and Nabla only works on images built for it, as of October 2018). The top contenders right now are
They all work on similar principles, with high variability in actual implementation. Alteration of the runtime is a fundamental part of their operation.
The benefit of this approach is, again, that it can help to limit what is even possible for a containerized process. It will not stop every attack, but it is designed to limit the attack surface.
Of the available options above, Kata and gVisor are the most flexible, working seamlessly with Docker (and by extension, Kubernetes and other orchestration systems). Nabla is very, very constrained, as indicated in its documented limitations.