Complexities at Different Stages of Data Science Projects

By Nathan Danneman

Data science is a complex field involving the intersection of coding, math/stats, and computer engineering. That being said, as the industry continues to advance, some areas of data science have become easier to navigate. AutoML-like methods for running hyperparameter selection experiments; wrappers for complex, deep-learning frameworks; and outcomes like DARPA’s D3M Program that seek to automate preprocessing and model selection – all of these tools make once complex or time-consuming data science tasks or pipelines much simpler. 

Given all this progress on automating portions of the data science workload, sometimes we get asked, “What makes data science hard?” This oft-repeated refrain got us thinking – what characteristics make a data science problem really thorny? One preliminary result is this post, where we set out complicating factors that make individual data science projects more difficult. In future posts, we will go into substantially more depth in providing explanations of each complicating factor, and pointing to solution spaces for each one. In this post, we highlight a set of factors that, any one of which, can take a data science effort from easy to complicated.

Complications Related to Obtaining Data

The first potential complicating factor in any data science project is, of course, the data. Where is your data coming from? It might be curated for you, or you might need to parse and clean it yourself. You may need to scrape it, and if so, your scraper might need to be adversarial. Perhaps, your inputs are the outputs of an upstream, machine learning, or modeling and simulation process. If so, what do you know about that data-generating process, and do you need to propagate error estimates from it? Data could arrive at your compute resources in a batch or as a stream. Supervised cases, where data are labeled, are typically easier; however, the data might arrive partially or incorrectly labeled. Even storing the training data can be complex; the data could be small enough to be locally storable, or you might need distributed storage. All these factors can contribute to the complexity of the problem before you have even begun the core task of data science, which is inducing high-potency models from data.

Complications When Modeling

The core of data science involves estimating a model that is useful. Complexity rears its head here immediately: “useful” depends on your context, and thus affects your decision on which outcomes to track or optimize.  Before beginning, you’ll need to decide which classes of models are reasonable, given your assumptions about the data generating process. If there are no canned models that seem appropriate, you’ll need to consider writing your own estimator, and the process for validating that effort. With a plausible (set of) estimator(s) in hand, difficulties abound. Your data might be so small (in terms of number of observations) that inducing a model is difficult, or so large that it is expensive, or so wise (in terms of features) that many classes of model do not make sense. If your model is so large that hyperparameter tuning is not feasible, you’ll have to carefully consider which hyperparameters to try. Missing data, if present, can make your preferred estimator perform erratically; class imbalance can cause convergence issues and poor predictive accuracy. Unsupervised problems come with their own complexities -- foremost among them that it may prove burdensome to find an accurate validation measure for any latent variables you discover. Parameter and predictive uncertainty may have varying levels of importance to your application, which means quantifying them may be complicated. Finally, for many applied efforts, models can vary in their accuracy across different portions of the input domain, and you are likely to need a systematic way to measure this.

Complications During Deployment

If you make your way through the above minefields, you have potentially induced a useful model. All done, right? Well, almost. We still need to talk about how you intend to deploy this model. You might push batches of data through your model, or have a streaming application. If streaming, your data rate could either be steady or bursty. It can be hard to determine if inbound data is stationary. And, if you are going to update your model online, you will need to guard against possible poisoning attacks.  After all this, each row of your test data might be complete; but if not, handling missing data at inference time introduces additional conundrums.

Partial Solutions and Combinatorial Complexity

Many of the above-referenced hiccups have well-understood, canned solutions. Intelligent upsampling (e.g. via SMOTE) can help with class imbalance. Small data is the purview of statistics generally, and Bayesian statistics specifically. Missing data can be ignored if likely to be missing completely at random (MCAR), or missingness itself can be accounted for during modeling. There is a well-worn literature on causal inference, and burgeoning attempts to apply deep learning thereto. Model-monitoring solutions are becoming more advanced, though still often require heavy engineering lifts. 

Unfortunately, many of the challenges highlighted here can co-occur, making the solutions to individual components untenable. For instance, substantial missing data at inference time likely makes model-monitoring tools fail; errorful labels wreak havoc on solutions for class imbalance; and wide data makes assessing stationarity substantially more difficult. In this series of posts, we will address these hiccups in a more specific manner and brainstorm various, possible solutions to make data science as a whole easier to understand and calculate.

Previous
Previous

Automate Inventory Management with DMC’s ICE

Next
Next

Understanding Docker Volume: Persisting a Grafana Configuration File