AUC vs Log Loss

By Nathan Danneman and Kassandra Clauser

Area under the receiver operator curve (AUC) is a reasonable metric for many binary classification tasks. Its primary positive feature is that it aggregates across different threshold values for binary prediction, separating the issues of threshold setting from predictive power. However, AUC has several detriments: it is insensitive to meaningful misorderings, is a relative measure, and does not incentivize well-calibrated probabilities. This brief describes these issues, and motivates the use of log loss as an evaluation metric for binary classifiers when well-calibrated probabilities are important.

AUC functionally measures how well-ordered results are in accordance with true class membership. As such, small misorderings do not affect it strongly, though these may be important to an intuitive sense of the performance of an algorithm. This means that, for some definition of “important”, it is possible for AUC to completely mask important misorderings: there is no sense of “how far off” something is in terms of its AUC.

AUC is a relative measure of internal ordering, rather than an absolute measure of the quality of a set of predictions. This hinders interpretability for downstream users. Furthermore, machine learning (ML) systems that use the output of other ML systems as features already suffer from performance drift; but this is greatly mitigated by restricting the output range of those upstream systems, and by using calibrated values. These two constraints make ML system chaining far more likely to produce valid results, even as the input systems are retraining. Proper calibration can be incentivized naturally by using a calibration-sensitive metric.

Log loss is another metric for evaluating the quality of classification algorithms. This metric captures the extent to which predicted probabilities diverge from class labels. As such, it is an absolute measure of quality, which incentivizes generating well-calibrated, probabilistic statements. They are easier to reason about for human consumers, and simpler to work with for downstream ML applications. Furthermore, like AUC, log loss is threshold-agnostic, and is thus a comparison of classifiers that does not have to pick decision threshold.

Overall, in cases where an absolute measure is desired, log loss can be simply substituted for AUC. It is threshold-agnostic, simple to compute, and applicable to binary and multi-class classification problems.

The picture here shows a hypothetical distribution of predicted probabilities by class. An example of good separation but low predictive power. You get great AUC but crummy logloss.


CPU Vulnerabilities and Design Choices


By Sean Leahy

Over the past several days, four white papers from different security teams were released detailing major and previously unknown CPU vulnerabilities. Links below.

To quickly summarize: these vulnerabilities expose memory of other running apps on the same physical system. This means that code running on a host without privileged access has free reign on that computer to view the data of any other process (including inside of Virtual Machines). While this is a major vulnerability issue, it is not beyond the assumptions that Data Machines makes when planning for security in our architectures.

Host privilege escalation is always a risk that needs to be considered and we plan for that in our larger security strategy. Toward this end, we've minimized scenarios in our systems where any type of host privilege escalation will compromise data in a manner that is problematic for our clients, partners, and their research.  Data Machines uses perimeter access (via a VPN) to provide strong attribution. We log any suspicious user behavior on the systems and mitigate/report it if we see an attempted exploit or unplanned data movements. Our clients do not experience any limitations on system or data access in the interim.

We generally allow researchers to have access to computation resources in any manner they see fit, with two caveats;

  1. We always need to know who they are at all times (strong attribution, no shared accounts, etc.)

  2. We don't let them create conditions that allow inbound connections from the Internet except for very tightly controlled situations (e.g. HTTPS with our hardened authentication in front of it, SSH, or SCP).

Data we hold for our clients is split into basically two groups: partner data and research data. Partner data is isolated on separate VLANs and is not on systems shared by the larger bodies of users. More specific constraints have been put in place on a case by case scenario. On the other extreme, we can and have kept data in cold storage offline in a GSA approved safe in a secure facility where it is only brought out for tightly controlled research sprints. In short, Data Machines works with each client to provide data handling and security practices that meet their specific needs and our capabilities are broad enough to accommodate nearly any request.

Security is a critical requirement for each project at Data Machines and should be designed with as few assumptions as possible. The recent security exploits listed below demonstrate how bad assumptions can result in a single point of failure that may compromise system behavior. Data Machines strives to produce robust security controls and practices around systems in anticipation of new and novel methods of exploitation. By making good architecture and system design choices, the impact of individual security failures can be minimized even when they are as extreme as CPU hardware vulnerabilities.

Here are links to the vulnerabilities announced yesterday: