Federation matters: Introducing the NIST Cloud Federation Reference Architecture

by Dr. Martial Michel, Dr. Rion Dooley

Data Machines Corp. (DMC) is a company that, among other things, designs cloud-based solutions for running evaluation frameworks to support research and commercial entities. On those frameworks, authorized users log onto the system using local credentials and are given access to a set of resources (compute, specialized hardware, etc) and data (including access to sequestered datasets). This is done using on-premise cloud solutions such as OpenStack or Kubernetes. Currently, Kubernetes and OpenStack are not able to communicate roles and privileges with one another; and when DMC users are in need of more resources than are locally available, bursting to public clouds needs to happen. To do so, identities and access tokens, as well as Virtual Private Connections and data copying, need to be established to enable a local user to use a remote cloud. This cloud-bursting use case happens for many companies over the world and often requires a “translation proxy” to interpret data from cloud to cloud. While technical solutions already exist to palliate this difficulty, there are no standardized means to provide such services to more users. Thus, it is the responsibility of each cloud provider to independently facilitate the sharing of identities and access to resources with the set of cloud providers it chooses to support. The ability to do this in a standardized way is the promise of Cloud Federation.

Fundamental use case

Cloud Federation can be summarized as the ability of two independent clouds to access and share resources. The typical use case follows:

Blog Post Figure 1.png

Figure 1: Cloud Federation authorization flow.

  1. A user authenticates to the Identity Provider of Organization A (OA) to establish their identity. Often, this is achieved using Identity Token exchanges; those have validity in a given cloud.

  2. As Organization A and B have a pre-established Federation relationship, where policies and governance have been negotiated, the user is able to query the Federation Manager of Organization A (FMA) and obtain information about access to services in Organization B.

  3. FMA provides the user’s entitlements to FMB.

  4. FMB translates the user’s entitlements to corresponding entitlements in Organization B based on established trust agreements between the two organizations; which include an Identity Token with validity in Cloud B for the User in Cloud A (often this mapping is designed to allow the User to exist in both clouds with “equivalent” identities).

  5. FMB returns the list of available services in Organization B to FMA.

  6. FMA returns the list of services from FMB to the User, as well as its authorization token for Cloud B.

  7. The User is then able to directly query the services in Organization B using its identity in Cloud B. Access to some resources is limited both by the agreements between Organizations A and B, and the access level of User A within Cloud A.

  8. The service in Organization B validates the identity and access of User A by checking with FMB, then completes the authenticated request and uses the service requested, returning the response to the User.

Architecture and Implementation

Figure 1 describes Cloud Federation at its fundamental level. In practice, its implementations may add caching, metadata tables, pre-established trust relationships, and back-channel messaging to increase performance and resiliency within a Federation. 

As such, Cloud Federation is a complex set of components and cloud computing services to match business needs: the interactions of federation models in a layered, three-plane representation of trust, security, and resource sharing and usage (see Figure 2 below). In the Trust Management Plane, Site A and Site B intend to collaborate on joint business goals and decide to establish a federation by establishing a trust relationship, defining the structure and governance of the federation they wish to create [as opposed to in the Federation Management Plane, after each Federation Operator deploys a Federation Manager (FM) that establishes secure communications between the FMs to exchange information concerning the management of federations]. In the Federation Usage Plane, Users from Site A and Users from Site B are able to access authorized services provided by Site A or Site B; this is a Virtual Administrative Domain. Maintaining the consistency and management of the users, services, policies, and authorizations of the Federated state is then an ongoing task as they could change dynamically over the course of the Federation’s lifetime.

Blog Post Figure 2.png

Figure 2: Cloud Federation 3 plane architectural component view.


Federation governance describes how the pieces of the architecture of a federation operate, work together, and interact. All federations have a set of essential characteristics that they share - from shared resources, to roles and attributes, to resource discovery, to membership, and to members’ identity credentials and terminations.

As such, as presented above, a cloud federation ecosystem is a specific configuration of semantics and governance where the formality of the ecosystem depends on the needs of its participants. Becoming a member of a federation implies providing some resources, accepting a set of rules, and controlling membership for new federation operators and members. 

The benefit of cloud federation for users is important; they can get access to remote resources (data, compute, etc) that are geographically bound, and still be able to perform their analysis without having to obtain a local copy of the used dataset. This is only possible if certain rights and privileges (and in most cases, billing and accounting), are integrated within the core capabilities of the Federation model. 

The NIST Cloud Federation Reference Architecture 

The NIST Cloud Federation Reference Architecture (currently an SP500 in public comment phase) defines an actor-/role-based model presenting an eleven-component model, which lend themselves to create a suite of federation options from simple to complex.

  1. Administrative Domains (AD): essentially comprised of an Identity Provider (IdP), a Cloud Service Provider (SP), and a User, an AD is an organization wherein a uniform set of discovery, access, and usage policies are enforced across a set of users and resources based on identity and authorization credentials that are meaningful within that organization. 

  2. Regulatory Environments: legal regulations and laws that the actors in an AD must observe. A federation may need to reconcile all relevant regulatory environments.

  3. Identity Providers: the source of the identity credentials in an AD.

  4. Cloud Service Provider: responsible for making a cloud service available.

  5. Cloud Service Consumer: maintains a business relationship with, and uses services from, Cloud Service Providers.

  6. Federation Manager: provides the essential federation management functions such as Membership Management, Policy Management, Monitoring and Reporting, Accounting and Billing, and Portability and Interoperability.

  7. Federation Operator: deploys, configures, and maintains one or more FMs.

  8. Federation Auditor: can assess compliance for any type of policy associated with a federation.

  9. Federation Broker: enables new members to discover existing federations.

  10. Federation Carrier: provides connectivity and transport among federation members, or between federation consumers and federation providers.

  11. Security: can cover the areas of identity/authentication, authorization/policy, integrity, privacy/confidentiality, and non-repudiation.


The development of a standardized federation matters. The availability of standardized federation managers, interfaces, and modularized deployment are some of the tools needed to widen the availability of federation capabilities beyond the big science community, and to jump-start wider markets. Those concepts are present in the NIST reference architecture. Data Machines Corp. has been involved with this effort since 2017. 

Dr. Martial Michel, DMC’s Chief Scientific Officer is one of the authors of the NIST SP500 document, the co-chair for the IEEE P2302 (Standard for Intercloud Interoperability and Federation), and a participant in conversations on this subject at international conferences (recently a panel at SuperComputing 2018). 

For the last 15 years, Dr. Rion Dooley has worked towards standards-based multi-cloud and multi-institution federation. In his current role as Director of Platform Services and Solutions at Data Machines Corp., and within his affiliation with the Agave Platform project, he continues to work towards solutions that enable meaningful, unobtrusive, scientific collaborations across academic and commercial boundaries.

As private cloud providers, DMC looks forward to the availability of cloud federation for the common interfaces that will allow us to further our integration of services to support research models as well as commercial designs. As cloud consumers, we look forward to leveraging the ability to seamlessly integrate within multiple clouds. We believe this will lead to the emergence of new usage models as users are empowered with the ability to make better choices. 

The NIST Cloud Federation Reference Architecture (CFRA) welcomes public comments until September 20, 2019. For more details, please see https://www.nist.gov/itl/cloud.

This work is the first step toward the development of a cloud federation standard. This effort will continue over the next few years; the development of standards happens over a period of time with the collaboration of many talented contributors. Proof of concept models might be developed by interested partners to be able to spearhead novel technical offerings in this new economy of cloud capabilities.

New collaborators are welcome: it is the work and ideas of many that help shape a better technical solution. For more information about the NIST CFRA or to join the conversation on Cloud Federation, reach out to Dr. Craig Lee (craig.a.lee@aero.org), Dr. Robert Bohn (robert.bohn@nist.gov), Dr. Martial Michel (martialmichel@datamachines.io), or Dr. Rion Dooley (riondooley@datamachines.io).

Toward a Containerized Nvidia CUDA, TensorFlow and OpenCV

Data Machines Corp. (DMC) works in fields that encompass Machine Learning (ML) and Computer Vision (CV). In those fields, we aim to empower researchers to discover, innovate, and invent. To do so, we strive to provide the tools to support these efforts.

Containers are powerful abstractions to the Linux kernel and allow for the development and test of projects through the pre-setup of required software and libraries, while limiting modifications to the host system. By using interactive shells with mounted directories on a running container, including X11 display if needed, researchers are able to test algorithms and ideas.

To do so, a one-time system setup is needed. In this post, we will be describing the steps needed to set up a Ubuntu 18.04 Linux system to run Nvidia Docker (v2) with an ultimate goal to use CUDA-optimized TensorFlow and OpenCV within a container.

Setting up the host system to run Nvidia Docker (v2)


A system that can host an amd64 (AMD64 & Intel EM64T) Ubuntu Linux installation (at least 2 GHz dual core processor, recommended 8 GB system memory, and 50 GB of free hard drive space) with internet access and an Nvidia CUDA-enabled GPU card (see <https://developer.nvidia.com/cuda-gpus for additional information>).

Install Ubuntu Linux 18.04

The following instructions run on a setup based on Ubuntu Linux 18.04 LTS:

  • Download a 64-bit PC (AMD64) desktop image of 18.04 from http://releases.ubuntu.com/18.04/.

  • The official amd64 Ubuntu Installation Guide can be found at https://help.ubuntu.com/lts/installation-guide/amd64/index.html.

  • Make sure that you have a sudo enabled user with network access to download additional packages.

  • After installation, perform system updates as needed, including confirming that gcc and build tools are installed; you can do so using sudo apt-get -y update && sudo apt-get install -y build-essential from a command prompt.

Install CUDA 10.1 drivers & libraries

In order to use CUDA on the Linux box, a Nvidia card needs to be available and recognized by the kernel. Confirm that the Linux kernel sees the Nvidia card(s) using lspci | grep -i nvidia to enumerate the list of available GPUs.

The following command lines will help add the CUDA 10.1 repository to your Ubuntu Linux (Line 1), add Nvidia's public key to the authorized ones (Line 2), update the local package list (Line 3), and install the CUDA packages onto the Ubuntu Linux system (Line 4).

sudo dpkg -i cuda-repo-ubuntu1804_10.1.168-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda

At the time of writing this document, CUDA 10.1.168 was the latest package; up-to-date instructions are available from https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork.

If you have trouble running those steps, additional installation instructions can be found at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions.

During this installation, the nouveau driver will be added to the modprobe (Kernel modules) blacklist file to prevent it from loading.

After installation of the CUDA 10.1 drivers and support libraries to the Linux system, add the following to the .bashrcfile (preferably before the "interactive" section of the file, if it has one), then reboot.

## CUDA add on
export PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64\

Rebooting will allow the Nvidia driver to be loaded as the default video driver for the Linux system. To confirm that the driver is properly loaded, use cat /proc/driver/nvidia/version; and to confirm it is functional, calling the Nvidia System Management Interface (SMI) will provide information on the GPU(s) available and their resources; this is done using nvidia-smi.

Installing a few additional third-party libraries is recommended before compilation of the code sample provided by Nvidia, to confirm compilation is functional and code can run.

sudo apt-get install g++ freeglut3-dev build-essential libx11-dev \
	libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev

The code sample can then be compiled in a temporary directory:

cd ~; mkdir -p Temp; cd Temp
cuda-install-samples-10.1.sh .
cd NVIDIA_CUDA-10.1_Samples

It is possible to run the compilation step (make) faster by using as many processors as available (if the system has enough memory to support it); simply run make -j$(nproc).

Once the build completes, enumerate the properties of the CUDA devices present in the system using the newly compiled ./bin/x86_64/linux/release/deviceQuery, and test the post-processing of an image rendered in OpenGL using CUDA while using ./3_Imaging/postProcessGL/postProcessGL. Additional samples are provided; more detail about them is available at https://docs.nvidia.com/cuda/cuda-samples/index.html.

If you need support for Nsight Eclipse (a full-featured IDE powered by the Eclipse platform that provides an all-in-one integrated environment to edit, build, debug, and profile CUDA-C applications), see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#install-nsight-plugins.

Post installation FAQ from Nvidia can be found at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#faq.

Docker Community Edition

Next, proceed with the installation of the stable Community Edition (CE) of Docker.

The following text will describe steps needed for the current installation; the full installation instructions from Docker can be found at https://docs.docker.com/install/linux/docker-ce/ubuntu/.

First, install tools needed for the CE install:

sudo apt-get install apt-transport-https ca-certificates \
	curl gnupg-agent software-properties-common

Add the Docker public key to the local trusted keys:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Add the Ubuntu Linux Stable Docker repository to the local repositories:

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \

Update the repository information using sudo apt-get update, then install the core components for Docker CE, its command line interface (CLI), and the containerd runtime using sudo apt-get install docker-ce docker-ce-cli containerd.io. Docker runs under the docker group, which will require your user to sudo in order to be able to use the command line. Note that the docker group grants privileges equivalent to the root user; for details on how this impacts security in your system, see "Docker Daemon Attack Surface" at https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface. Understanding this, the following adds your user to the docker group: sudo usermod -aG docker $USER. The new group will not be taken into effect until you log out (in some cases, reboot) and log back in. Before continuing, check that the docker group is listed as part of your groups using id.

Finally, confirm that docker is operational: docker run -rm hello-world. The -rm will delete the container once it is done running (so you do not have unknown containers left when you run a docker ps -a).

Nvidia Docker

Nvidia Docker is a runtime for the Docker daemon that abstracts the Nvidia GPU(s) available to the host OS using the Nvidia driver, such that a container CUDA toolkit uses the host's Nvidia driver. For more details on the technology, see https://devblogs.nvidia.com/gpu-containers-runtime/.

The Github repository for Nvidia Docker (v2) is available at https://github.com/NVIDIA/nvidia-docker.

On our Ubuntu Linux, adding the runtime is done by adding the Nvidia public key to the trusted keys, and adding the Nvidia repository to the system list:

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

After updating the package list, install the nvidia-docker2 runtime, and restart the Docker daemon to have it add the new runtime to its available list:

sudo apt-get update && sudo apt-get install -y nvidia-docker2 && sudo pkill -SIGHUP dockerd

Confirm that it is functional by running the same nvidia-smi command through the container interface:

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

DMC's CUDA/TensorFlow/OpenCV

TensorFlow [https://www.tensorflow.org/] is an open source library to help users develop and train ML models using a comprehensive, flexible ecosystem of tools and libraries. TensorFlow is available as a GPU-optimized container image, running CUDA 9.0 on an Ubuntu 16.04, to create virtual environments that isolate a TensorFlow installation from the rest of the system while sharing the resources of the host machine.

OpenCV (Open Source Computer Vision Library) [https://opencv.org/] is an open source computer vision and machine learning software library, with more than 2500 optimized algorithms to support classic and state-of-the-art computer vision and machine learning algorithms. OpenCV can be built to support CUDA (for GPU support) and OpenMP (for shared-memory and high-level parallelism on multi-core CPUs) [https://www.openmp.org/].

DMC has built a cuda_tensorflow_opencv container image by compiling a CUDA/OpenMP-optimized OpenCV FROM the GPU-optimized TensorFlow container image. This image is designed to contain many needed tools to be used by data scientists and to support machine learning research, with frameworks such as NumPy [https://www.numpy.org/], pandas [https://pandas.pydata.org/], and Keras [https://keras.io/].

Because of its integration of many core tools and libraries useful to researchers, the image can be used as a FROM for further Docker images to build from. For example, it has successfully been used to build a GPU/OpenMP/OpenCV-optimized Darknet (Open Source Neural Networks in C) [https://pjreddie.com/darknet/] "You Only Look Once" (v3) [https://pjreddie.com/darknet/yolo/] processor using Python bindings [https://github.com/madhawav/YOLO3-4-Py].

The container image also has support for X11 display for interactive use, such that a user can call the provided runDocker.sh script from any location. That location will then be automatically mounted as /dmc and be accessible to the user as an interactive shell to perform quick code prototyping without needing to set up a complex environment. For example, using numpy and OpenCV to load and display on the user's X11 through the container, a picture can be found in the directory mounted as /dmc.

DMC has made the Dockerfile and supporting files, including usage and build instructions, publicly available on its GitHub at https://github.com/datamachines/cuda_tensorflow_opencv.

Building this container image is a long process; it was originally intended to be auto-built using Docker Hub from the repository provided on GitHub, but the original try took over 4h on Docker Hub and was automatically canceled by the build system. On a 2.8GHz quad core 7th gen i7, using a SSD, and running 8 concurrent make, it still takes 1h hour (per tag) to build. As such, we have made available on our Docker Hub the final builds for different tags, which can directly be used as FROM for your Docker containers at https://hub.docker.com/r/datamachines/cuda_tensorflow_opencv.

USB Safety

by Kassandra Clauser

You have heard of the basic cybersecurity practices: keep long and complicated passwords which are unique to each site, do not click untrusted email attachments, and only enter sensitive information into trusted websites that are https (or http secured). But, there are other hazards which are easy to miss.

Ordinarily, the USB – both the thumb drive or “flash drive”, and the cable charger – is a handy device. Students use the first to store documents they need to transfer from a home desktop to the library computer lab. Almost every individual uses the latter to charge their phones. However, USB can also be a dangerous vector for malware, even more so to the victims who are both unaware and unprepared. USB chargers can be used to charge your phone, but they can also sap your data as fast as a vampire draining your blood; and USB drives can store files, but they can also store viruses that can hijack and ransom your data. For example in Jack Morse’s Mashable post titled “Yes, officials plugged in the malware-laden USB seized at Mar-a-Lago”, during the recent Mar-a-Lago incident, in which the Secret Service confiscated and inserted a USB thumb drive from an intruding Chinese woman, the club’s computer was corrupted with data-gathering files – all because somebody decided to plug in the mysterious USB. Granted, Morse points out at the end of the article that this somebody was a computer analyst (Morse). That said, there was likely a “method to the madness”, as the saying goes. However, this is an action that can cause serious damage to anyone unfamiliar with cybersecurity hygiene. That said, in this post, we will discuss the four big rules of USB safety.

1.     “Rogue” Means “Regret”: Thumb drives and cables dropped on the ground could be genuinely lost by their owners. The first might not have anything on them besides vacation photos, and there is always a chance that a cable might have truly fallen out of someone’s backpack. But, this is a best case scenario. Chargers have also been observed to transfer malware from one device to another. The 2019 Verizon DBIR reports that out of all the past year’s data breaches, 28% involved malware being installed on devices (Verizon). While not every USB is guaranteed to be malware-infested and ready to kill your computer, it is still safer to leave random thumb drives and cables/chargers alone.

2.     Just Because It is being Resold, Does Not Mean It is Safe: Flash drives and cables are some of those items you can find at garage sales (or “lost-and-found” sales, which are hosted at universities when lost items have not been claimed after a certain amount of time). Usually, each only costs about a dollar, fifty cents if you are “lucky”. The trouble is, when you buy a USB thumb drive or cable at a yard sale or flea market, you are most likely buying it from a stranger whose intents and motives are completely unknown to you. In some cases, you might be purchasing a virus directly from the developer, while in others, you might be purchasing from another neglectful user who may or may not be aware of the malware being passed to you over the counter. In either case, these viruses can do anything from crashing your computer to phishing personal information like passwords and credit card numbers. Consider a resold USB like a resold, used water bottle: Has it been cleaned? Probably. Can you be absolutely sure? Definitely not.

3.     Handouts are NOT Always Free: Technology conferences are held all over the world; and, unfortunately, not all companies – inside or outside of the United States – are honest. Verizon reports that of all the breaches in the past year, 69% were caused by outsiders, and that 23% of all breaches were caused by those of other states or nations. The report also points out that the top two targets of security breaches were small business employees (43%) and members of the Public sector (16%) (Verizon). Wherever there is business, there is always a chance of outsider espionage. Most of these companies wish to obtain information that you would not willingly give them. What better way for them to earn your trust and get what they want through the seemingly friendly handout of USB devices? On the same lines as Rule #2, you as the receiver do not have any way of knowing whether or not the device is safe to use. More importantly, you have no way of knowing if there is malware on the device until you have already plugged it in. This applies to more than the standard USB drive and charger. When President Donald Trump visited North Korea in 2018, guests were provided portable fans, easily plugged into a smart phone via a USB connection, in order to keep cool (Gibson). While the gesture seemed cordial on the outside, there was no way to tell whether or not these drives contained viruses made to phish the guests’ personal – and/or political – information. In the technology world, handouts are not just suspicious – they are potentially dangerous.

4.     Charging CAN be Hazardous: If you are using your own charger to charge your own device in a trusted and delegated charging area, you should be safe (as long as the cord is clean and virus-free, obtained from an authorized and trusted seller). However, on occasion, you may need to charge your phone whilst in a public place, like the airport. While you might be glad to get the extra juice so you can catch that Level 10 Pikachu while waiting for your flight, you might not be so glad to have your data drained from your phone to a malicious source. According to Wikipedia, charging a device in an unprotected and/or public space can lead to the phishing of personal information, a process known as “juice jacking” (as coined by journalist Brian Krebs), with you being none the wiser as to who is on the other side (Wikipedia). Basically, the only way to be sure that your phone is safely charging without losing data is to literally resort to using your own devices, including your own secure power source.

While not every USB thumb drive and charging cable is a source of Doomsday waiting to cause cyber-Armageddon on your virtual systems, taking chances on optimistic possibilities is not worth the risk. This is why taking preventative measures is important. Protect yourself by remembering the rules above. As an extra measure, consider purchasing technological accessories such as the SyncStop, a USB cable meant to charge your phone whilst preventing the drainage of data, to keep your information safe from criminal grasps.

Works Cited

Gibson, S. (Producer). (2018, July 10) STARTTLS Everywhere. [Security Now]. Retrieved from https://www.grc.com/sn/sn-671.pdf.

Juice Jacking. (2019, April 7). In Wikipedia. Retrieved May 29, 2019 from https://en.wikipedia.org/wiki/Juice_jacking.

Morse, J. (Contributor). (2019, April 8) “Yes, officials plugged in the malware-laden USB seized at Mar-a-Lago”. Retrieved from https://mashable.com/article/malware-usb-mar-a-lago-plugged-in/.

Summary of Findings. (2019, n.d.) In Verizon. Retrieved May 30, 2019 fromhttps://enterprise.verizon.com/resources/reports/dbir/2019/summary-of-findings/.

Calibration and Predictive Uncertainty

By Iliana Maifeld-Carucci

As the ability of computers to process large amounts of data has increased, machine learning has risen in usage and influence in order to gain insights from that data. In particular, neural networks (NNs) and, more recently, deep neural networks (DNNs) have achieved substantial advances in accuracy, and are being used in a variety of applications such as natural language processing and computer vision. As the applications of these machine learning techniques advance into our everyday lives, however, they make critical decisions that can seriously impact human lives. Unfortunately, impactful mistakes will, undoubtedly, increase as machine learning continues to infiltrate the technology people rely on. However, if there were a way to estimate when a machine learning system was unsure about its prediction, machine learning mistakes could be eliminated or better mitigated. That is why research into quantifying how confident an algorithm is in its prediction is critical as these algorithms become ubiquitous for use in daily life.

Unfortunately, unlike statistical methods, most neural networks do not incorporate uncertainty estimation, because the structure of the algorithms assumes that a single parameter generated the data distribution, and thus does not represent the distributions of parameters. Furthermore, their predictions are often overconfident which is equally, if not more, dangerous than when they are wrong. However, in recent years, a new wave of interest and research has emerged into quantifying various types of uncertainty in deep neural networks.

When estimating uncertainty in deep neural networks, there are two main types. Aleatoric uncertainty deals with the noise inherent to the data while epistemic uncertainty quantifies the variability in a particular model. Aleatoric uncertainty can be broken down further into homoscedastic and heteroscedastic statistical dispersions. Homoscedastic uncertainty delineates instances when the uncertainty remains constant for different inputs. On the other hand, heteroscedastic uncertainty is used to describe cases where different inputs yield different levels of uncertainty. Most research does not consider epistemic uncertainty, because it can often be dealt with by using more data and only focuses on aleatoric uncertainty.

One measure of uncertainty that is closely related to aleatoric uncertainty is calibration, which indicates how closely the probability associated with the predicted class label reflects its ground truth likelihood. Most classification models generate predictions along with summary metrics such as accuracy or AUC/ROC. Yet, many people mistake these predictions as probabilities and stop evaluating their model beyond measures of accuracy. However, how can these probabilities be trusted to be accurate? This is where the importance of calibration enters the picture. One can use the example of weather predictions to better understand the difference between predicted likelihood and calibration. Let’s say you’re getting ready to leave your house and want to check the weather to decide if it is worthwhile to bring a rain jacket. The weather says the chance of rain that day is 20%. If, in 2 out of the 10 times the forecaster said there was a 20% chance of rain, it rained, then you would probably come to trust that they had a reliable forecast and not bring your jacket. If this reliability at predicting rain transferred to all other likelihood ranges, such as 90% or 40%, then the forecaster’s model would be well calibrated. Having a model that is not just accurate, but also well calibrated can help increase trust in a machine learning model and help avoid costly mistakes.

Because calibration is valuable to the estimation of uncertainty, the choice of scoring rule (or outcome measure) is important. By assigning a score to a predictive distribution, a scoring rule rewards better calibrated predictions more than worse ones. A proper scoring rule should evaluate the quality of the predictive distribution to be less than or equal to the true distribution. Such a scoring rule can then be incorporated into a neural network through minimizing it as the loss function. The negative log likelihood, which is a standard measure of probabilistic quality, also known as cross-entropy loss, is one proper scoring rule. In the binary classification problem, a metric called the Brier score, which minimizes the squared error between the predicted probability of a label and its true label, is also a proper scoring rule. The Brier score is a single number composition of three other measures calculated as reliability, minus uncertainty, plus resolution. Reliability is an overall measure of calibration that quantifies how close the predicted probabilities are to the true likelihood; uncertainty measures the inherent uncertainty in the predictions; and resolution assesses the degree to which predicted probabilities placed into subsets differ in true outcomes to the average true outcome. These last two measures, uncertainty and resolution, can also be aggregated into a measure known as ‘refinement’, which is associated with the area under the ROC curve and, when added to reliability, yields the Brier Score.

With the adoption of machine learning in a range of applications, it is no longer good enough for models to achieve high accuracy; they should also have a high level of calibration. By utilizing the concept of calibration applying negative log likelihood and the Brier Score, we have developed a method to incorporate predictive uncertainty into deep neural networks, particularly in a scenario where researchers do not have access to a model’s specifics. If you are interested in further details about this research effort, please contact us.

Analytic Containers and Security

By Chris Monson and Eric Whyne

This post summarizes problems associated with running user-submitted analytic containers in a cloud-based platform for data analytics as well as potential mitigations. Although many tools exist for static code analysis and vulnerability analysis of containers from a trusted source, very little advice or tooling exists which allows safely executing precompiled analytic containers from untrusted or adversarial sources. The packaging of analytics by container, despite the inherent risks, continues to be a theme adopted by data providing platforms and analytic execution environments in general because of the amazing dependency management and efficiencies gained from packaging software in this manner. We are undertaking development efforts to address these gaps; contact us if you are interested in more details. In the meantime, here are some recommendations to make this practice safer using current methods and technologies.

Containers are in many ways insecure by default. There are a combination of factors that must come together simultaneously to give a reasonable expectation of security, and these include:

  • Images and build parameters

  • Runtime flags

  • Mounted resources

  • Kernel vulnerabilities

The Linux kernel’s cgroups and namespaces, on which all container technology is built, are intended to allow nearly arbitrary constraints on what a process can see and do. They are, effectively, parameters that affect how a process is executed, and thus differ from a VM or simulator: there is no barrier between the process and the host; the host kernel merely dictates (to the extent possible) how the process interacts with and perceives its operating environment.

Because of this, achieving true security while running containers with unknown contents can be challenging. It is, however, something that people want to do in diverse circumstances, including cloud providers who must run arbitrary containers on shared systems, and thus there are some standard approaches for increasing safety. Assuming that changing the built image is not within the set of acceptable solutions, these fall under several broad categories:

  • Runtime best practices

  • VM Nesting

  • Monitoring

  • Scanning

  • Kernel virtualization

We cover each of these in turn, highlighting the benefits and limitations of each approach.

Runtime Best Practices

By using widely acknowledged best practices when running a container, a number of potential exploit routes are closed off.

Restrict access to the docker socket

The docker daemon is responsible for starting processes, and thus is responsible for setting up cgroups and namespaces for the processes that it launches (when you “run a container”, the daemon is really setting up namespaces and cgroups while launching a process based on the settings in the container image file).

The daemon must run as root, but its socket file that the client uses to communicate with it does not need to be owned by root. Because access to the docker daemon grants effective root privileges (after all, it can start arbitrary processes and is running as root), access should be restricted to the socket that communicates with that daemon: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

This includes mounting the socket into containers. They should never have direct access to it.

There are several places online, even from reputable sources that ought to know better, where it is suggested that the docker socket be mounted into a container at runtime. This is, for the same reasons that clients should not run as root, a terrible idea. Any container that can access the docker socket is one privilege escalation bug away from having effective root access on the host, for the same reasons mentioned above. That is the best case scenario: if any misconfiguration of the socket has occurred, or the container’s host-level user has access rights, then that user is now effectively root.

This is commonly justified for containers that need to start other containers as part of their job. If a process requires you to mount the socket within its container for it to work properly, that should be a red flag. The correct way to allow a container to speak to the docker daemon is via TLS. Using the Docker https service allows proper authentication and authorization to be done on a case-by-case basis without exposing effective root privileges to the container.

Restrict Capabilities

Container processes should not be running as root; but in a container that is provided from external sources, this is not always possible to enforce. Thus, using the user namespace is recommended in all cases, which maps the in-container root user to an unprivileged user on the host machine.

Additionally, capabilities above and beyond the defaults should only be added sparingly. Some of the defaults are pretty wide in scope, so limiting them further can be helpful, as well.

Limit Resources

Containers should always be run with memory and CPU limits. These are off by default, meaning that a runaway container, even if unable to exploit a system to run what it wishes, can still cause the OOM killer to trigger and potentially kill other processes, or to starve them for CPU resources.

Setting these limits helps ensure that containers play nicely with critical system resources and are killed first when attempting to use too much memory (as opposed to important system processes getting killed first).


For secure operation, we need not only to think about how to keep processes contained, but also how to keep sensitive credential information from being misused if it is found. Much of the runtime advice in this section has been devoted to configurations that help to prevent bad actions from being carried out. But, in the event that such an action succeeds, it is also important to limit the amount of damage that it can do, and the value of the secrets that it can exfiltrate.

Best practices dictate that any credentials given to a contained process that allow it to do its work (e.g. keys to access internal services such as a database) be valid for a limited time and with limited scope. This can best be accomplished by using a key management system such as Vault. These systems are responsible not only for storing keys to sensitive resources, but also for rotating those keys and providing ephemeral keys to processes that need them.

Ephemeral keys are an important component of security, because they severely restrict the access a bad process can exploit. If it is noticed that a process is acting badly, its keys can be easily revoked without disturbing the rest of the system, and the keys that it does have can only be used for a limited time.

This is one of the less well-known aspects of security, but is critically important. If the container needs keys, those keys should be temporary and easily revoked. There are sidecar processes that can be used to manage keys in a way that allows containers to use them without changes to the container itself (such as writing them to a shared volume, then issuing a “kill -HUP” to the process to get it to reload its keys).

Rule-Based Execution

Rule-based execution, such as seccomp, SELinux, and AppArmor, are another important consideration when running containers. These rules are often omitted, but are needed in order to constrain what a container can do on a system. They are quite flexible and reside at the top of a fairly steep learning curve, but will not be neglected in a serious, security-conscious environment. Additionally, tools exist for reducing the configuration burden.

Use Orchestration

Running containers by themselves can work fine, but it is actually best to always use some form of orchestration, such as Docker Swarm or Kubernetes. These provide additional network configuration and easier management of processes than the relatively low-level docker command. Most containers need to speak to other containers over a network, and creating such a network with proper isolation by hand is not usually trivial. Swarm and Kubernetes both handle this for you and provide monitoring resources. Thus, a best practice is to always use orchestration of some kind, unless you are really only running one thing.

Other Runtime Considerations

It is usually best to use the --init flag when running a docker container for an external image. This helps ensure that container shutdown handles child process reaping correctly. Without knowing beforehand whether a container intends to fork and manage children, --init causes PID 1 to be a proper init process (defaults to tini) with appropriate signal handling. Signal handling is subtle and easy to get wrong, so many main processes that fork end up getting it wrong. Using an init process helps to at least get child process cleanup right when a container’s main process exits.

Ultimately, the runtime best practices are a means of setting things up to make exploits less likely to succeed. These are only as strong as the underlying kernel and hardware, however (for example, a Spectre exploit can still work from within a container).

VM Nesting

Because a containerized process runs on the host kernel with only the kernel’s own cgroup and namespace protections keeping the process contained, there is often a need for greater protection. VM nesting is one approach that can provide an extra layer of security.

There are two basic approaches that people are using, both described below.


To decrease the amount of possible damage a rogue process can cause when escaping a container, one might run docker containers inside of a virtual machine, such as an OpenStack or VMWare instance. This can keep sensitive files or credentials from being easily accessed, and can limit network access depending on the host VM’s configuration.

Virtual machines are much heavier to operate than containers, and they incur a performance penalty, so this is usually only done if the convenience and security outweigh performance or packing efficiency needs.

It is not uncommon to run a Kubernetes cluster over OpenStack, where all of its nodes are virtual. This can limit the extent of damage done by rogue processes, but those processes will still have access to everything in the cluster, so it is not necessarily considered to be a security measure to deploy in this way. It is convenient, but it does not produce extra security, per se.


Systems like KubeVirt and Virtlet allow virtual machines to be treated like containers (KubeVirt works with Kubernetes by augmenting its ability to manage things to include VMs, where Virtlet actually implements the Container Runtime Interface, and thus actually looks like a container, allowing it to be part of a Deployment, for example).

Both approaches allow for VM workloads to be deployed alongside container workloads in the same Kubernetes cluster. In some ways, this means that one does not need to choose between containers and VMs — use containers for things you trust and VMs for things you do not.


Monitoring is a must-have item for any secure container deployment. It allows people and algorithms to notice unusual behavior over the network, with files, and with resource utilization. While not strictly a security measure, blindness is definitely a security liability.

At a minimum, something akin to an ELK stack should be set up to stash and analyze logs from containers, and cluster-wide monitoring of basic resources (network, files, cpu, memory) should be running with high availability dashboards at all times. Breaches will happen, and when they do they should not go unnoticed.


Security scanning is similar to virus scanning: it is the process of looking at a container image’s contents (the static non-running contents) in search of signatures for known-bad processes. For example, a scanner could look for traces of Metasploit, a common component of penetration toolkits. There are other common components, and exploit code makes its way around the Internet quite rapidly once available, often with minimal changes.

Vulnerability scanning is something that large companies are providing to those who build their containers on their platforms (e.g. Google’s registry scanner). These are typically geared toward looking for vulnerable configurations, e.g. a recent one found in the immensely popular Alpine Linux image for Docker (it was set up to permit a man-in-the-middle attack on the package manager).

By knowing that the base images used to compose a container image are up to date with security patches, it becomes easier to trust that attacks will not succeed from outside of the container. Furthermore, containers are usually running in ecosystems with other containers nearby, and if those containers contain vulnerabilities (e.g. a data exfiltration bug on a web server), even applications that do not escape their own containers can potentially wreak havoc by scanning for vulnerabilities in containers within their reach.

Scanning for popular penetration-testing software as well as base image system vulnerabilities can provide a degree of confidence that the container ecosystem has a small attack surface that is not known to be easily exploitable.

Kernel Virtualization

A recent approach to securing containers is to intercept all syscalls and only forward them to the host kernel if they are allowed. This is accomplished by creating a thin user-space kernel on which the container runs. It replaces the container’s runtime (a configuration in the docker service) with its own runtime that sets everything up. In the case of gVisor, for example, the runtime kernel forwards many syscalls (when allowed), but file and network access are sent to separate processes that employ more in-depth protection and filtering than a kernel has access to.

These approaches are still relatively nacent and can sometimes impose difficult-to-meet constraints (for example gVisor only works on 32-bit binaries, and Nabla only works on images built for it, as of October 2018). The top contenders right now are

They all work on similar principles, with high variability in actual implementation. Alteration of the runtime is a fundamental part of their operation.

The benefit of this approach is, again, that it can help to limit what is even possible for a containerized process. It will not stop every attack, but it is designed to limit the attack surface.

Of the available options above, Kata and gVisor are the most flexible, working seamlessly with Docker (and by extension, Kubernetes and other orchestration systems). Nabla is very, very constrained, as indicated in its documented limitations.

AUC vs Log Loss

By Nathan Danneman and Kassandra Clauser

Area under the receiver operator curve (AUC) is a reasonable metric for many binary classification tasks. Its primary positive feature is that it aggregates across different threshold values for binary prediction, separating the issues of threshold setting from predictive power. However, AUC has several detriments: it is insensitive to meaningful misorderings, is a relative measure, and does not incentivize well-calibrated probabilities. This brief describes these issues, and motivates the use of log loss as an evaluation metric for binary classifiers when well-calibrated probabilities are important.

AUC functionally measures how well-ordered results are in accordance with true class membership. As such, small misorderings do not affect it strongly, though these may be important to an intuitive sense of the performance of an algorithm. This means that, for some definition of “important”, it is possible for AUC to completely mask important misorderings: there is no sense of “how far off” something is in terms of its AUC.

AUC is a relative measure of internal ordering, rather than an absolute measure of the quality of a set of predictions. This hinders interpretability for downstream users. Furthermore, machine learning (ML) systems that use the output of other ML systems as features already suffer from performance drift; but this is greatly mitigated by restricting the output range of those upstream systems, and by using calibrated values. These two constraints make ML system chaining far more likely to produce valid results, even as the input systems are retraining. Proper calibration can be incentivized naturally by using a calibration-sensitive metric.

Log loss is another metric for evaluating the quality of classification algorithms. This metric captures the extent to which predicted probabilities diverge from class labels. As such, it is an absolute measure of quality, which incentivizes generating well-calibrated, probabilistic statements. They are easier to reason about for human consumers, and simpler to work with for downstream ML applications. Furthermore, like AUC, log loss is threshold-agnostic, and is thus a comparison of classifiers that does not have to pick decision threshold.

Overall, in cases where an absolute measure is desired, log loss can be simply substituted for AUC. It is threshold-agnostic, simple to compute, and applicable to binary and multi-class classification problems.

The picture here shows a hypothetical distribution of predicted probabilities by class. An example of good separation but low predictive power. You get great AUC but crummy logloss.


CPU Vulnerabilities and Design Choices


By Sean Leahy

Over the past several days, four white papers from different security teams were released detailing major and previously unknown CPU vulnerabilities (links are provided below).

To quickly summarize: these vulnerabilities expose memory of other running apps on the same physical system. This means that code running on a host without privileged access has free reign on that computer to view the data of any other process (including inside of Virtual Machines). While this is a major vulnerability issue, it is not beyond the assumptions that Data Machines makes when planning for security in our architectures.

Host privilege escalation is always a risk that needs to be considered, and we plan for that in our larger security strategy. Toward this end, we have minimized scenarios in our systems where any type of host privilege escalation will compromise data in a manner that is problematic for our clients, partners, and their research. Data Machines uses perimeter access (via a VPN) to provide strong attribution. We log any suspicious user behavior on the systems and mitigate/report it if we see an attempted exploit or unplanned data movements. Our clients do not experience any limitations on system or data access in the interim.

We generally allow researchers to have access to computation resources in any manner they see fit, with two caveats;

  1. We always need to know who they are at all times (strong attribution, no shared accounts, etc).

  2. We do not let them create conditions that allow inbound connections from the Internet except for very tightly controlled situations (e.g. HTTPS with our hardened authentication in front of it, SSH, or SCP).

Data we hold for our clients is split into basically two groups: partner data and research data. Partner data is isolated on separate VLANs and is not on systems shared by the larger bodies of users. More specific constraints have been put in place on a case-by-case scenario. On the other extreme, we can and have kept data in cold storage offline in a GSA-approved safe in a secure facility where it is only brought out for tightly controlled research sprints. In short, Data Machines works with each client to provide data handling and security practices that meet their specific needs, and our capabilities are broad enough to accommodate nearly any request.

Security is a critical requirement for each project at Data Machines and should be designed with as few assumptions as possible. The recent security exploits listed below demonstrate how bad assumptions can result in a single point of failure that may compromise system behavior. Data Machines strives to produce robust security controls and practices around systems in anticipation of novel methods of exploitation. By making good architecture and system design choices, the impact of individual security failures can be minimized even when they are as extreme as CPU hardware vulnerabilities.

Here are links to the vulnerabilities announced yesterday: