Calibration and Predictive Uncertainty

By Iliana Maifeld-Carucci


As the ability of computers to process large amounts of data has increased, machine learning has risen in usage and influence in order to gain insights from that data. In particular, neural networks (NNs) and, more recently, deep neural networks (DNNs) have achieved substantial advances in accuracy, and are being used in a variety of applications such as natural language processing and computer vision. As the applications of these machine learning techniques advance into our everyday lives, however, they make critical decisions that can seriously impact human lives. Unfortunately, impactful mistakes will, undoubtedly, increase as machine learning continues to infiltrate the technology people rely on. However, if there were a way to estimate when a machine learning system was unsure about its prediction, machine learning mistakes could be eliminated or better mitigated. That is why research into quantifying how confident an algorithm is in its prediction is critical as these algorithms become ubiquitous for use in daily life.

Unfortunately, unlike statistical methods, most neural networks do not incorporate uncertainty estimation because the structure of the algorithms assumes that a single parameter generated the data distribution, and thus does not represent the distributions of parameters. Furthermore, their predictions are often overconfident which is equally, if not more, dangerous than when they are wrong. However, in recent years, a new wave of interest and research has emerged into quantifying various types of uncertainty in deep neural networks.

When estimating uncertainty in deep neural networks, there are two main types. Aleatoric uncertainty deals with the noise inherent to the data while epistemic uncertainty quantifies the variability in a particular model. Aleatoric uncertainty can be broken down further into homoscedastic and heteroscedastic statistical dispersions. Homoscedastic uncertainty delineates instances when the uncertainty remains constant for different inputs. On the other hand, heteroscedastic uncertainty is used to describe cases where different inputs yield different levels of uncertainty. Most research does not consider epistemic uncertainty, because it can often be dealt with by using more data and only focuses on aleatoric uncertainty.

One measure of uncertainty that is closely related to aleatoric uncertainty is calibration which indicates how closely the probability associated with the predicted class label reflects its ground truth likelihood. Most classification models generate predictions along with summary metrics such as accuracy or AUC/ROC. Yet many people mistake these predictions as probabilities and stop evaluating their model beyond measures of accuracy. However, how can these probabilities be trusted to be accurate? This is where the importance of calibration enters the picture. One can use the example of weather predictions to better understand the difference between predicted likelihood and calibration. Let’s say you’re getting ready to leave your house and want to check the weather to decide if it is worthwhile to bring a rain jacket. The weather says the chance of rain that day is 20%. If, in 2 out of the 10 times the forecaster said there was a 20% chance of rain, it rained, then you would probably come to trust that they had a reliable forecast and not bring your jacket. If this reliability at predicting rain transferred to all other likelihood ranges, such as 90% or 40%, then the forecaster’s model would be well calibrated. Having a model that is not just accurate, but also well calibrated can help increase trust in a machine learning model and help avoid costly mistakes.

Because calibration is valuable to the estimation of uncertainty, the choice of scoring rule (or outcome measure) is important. By assigning a score to a predictive distribution, a scoring rule rewards better calibrated predictions more than worse ones. A proper scoring rule should evaluate the quality of the predictive distribution to be less than or equal to the true distribution. Such a scoring rule can then be incorporated into a neural network through minimizing it as the loss function. The negative log likelihood, which is a standard measure of probabilistic quality, also known as cross-entropy loss, is one proper scoring rule. In the binary classification problem, a metric called the Brier score, which minimizes the squared error between the predicted probability of a label and its true label, is also a proper scoring rule. The Brier score is a single number composition of three other measures calculated as reliability, minus uncertainty, plus resolution. Reliability is an overall measure of calibration that quantifies how close the predicted probabilities are to the true likelihood; uncertainty measures the inherent uncertainty in the predictions; and resolution assesses the degree to which predicted probabilities placed into subsets differ in true outcomes to the average true outcome. These last two measures, uncertainty and resolution, can also be aggregated into a measure known as  ‘refinement’, which is associated with the area under the ROC curve and, when added to reliability, yields the Brier Score.

With the adoption of machine learning in a range of applications, it is no longer good enough for models to achieve high accuracy; they should also have a high level of calibration. By utilizing the concept of calibration applying negative log likelihood and the Brier Score, we have developed a method to incorporate predictive uncertainty into deep neural networks, particularly in a scenario where researchers don’t have access to a model’s specifics. If you are interested in further details about this research effort, please contact us.