Evaluating models for a needle in a haystack: Applications in predictive maintenance
In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.
|Talk Title||Evaluating models for a needle in a haystack: Applications in predictive maintenance|
|Conference||Strata + Hadoop World|
|Conf Tag||Make Data Work|
|Location||New York, New York|
|Date||September 27-29, 2016|
Predictive maintenance is about anticipating a failure and taking preemptive action. With the recent advances in accessible machine learning and cloud storage, there is tremendous opportunity to utilize the entire gamut of data coming from factories, buildings, machines, and sensors to not only monitor the health of equipment but also predict when it is likely to malfunction or fail. However, as simple as it sounds in principle, in reality the data required to actually make a prediction in advance and in a timely manner is hard to come by. The data that is collected is often incomplete, partial, or just not enough, making it unsuitable for modeling. In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Ideally, the data should have hundreds or even thousands of failures. However, unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. But even in these cases, the distribution or the ratio of failure to nonfailure data is highly skewed. Modeling for failure thus often falls under the classic problem of modeling with imbalanced data when only a fraction of the data constitutes failure. Standard methods for feature selection and feature construction do not work so well for imbalanced data. Moreover, the metrics used to evaluate the model can be misleading. Danielle Dean and Shaheen Gauher discuss the best ways to build and evaluate models, offering examples that reference sample code in regular open source R as well as Microsoft R Server, which allows the computations to be done on big data. Danielle and Shaheen explain why a clear understanding of business requirements and tolerance to false negative and false positives is necessary. For example, for some businesses, failure to predict a malfunction can be extremely detrimental (e.g., aircraft engine failure) or exorbitantly expensive (e.g., production shutdown in a factory), while for others falsely predicting a failure when there is none leads to a significant loss of time and resources. In the language of statistics, this is what we call misclassification cost. Danielle and Shaheen conclude by illustrating how to deal with imbalanced data through two predictive maintenance example case studies.