Multinode restricted Boltzmann machines for big data

Nikolay Manchev offers an overview of the restricted Boltzmann machine, a type of neural network with a wide range of applications, and shares his experience using it on Hadoop (MapReduce and Spark) to process unstructured and semistructured data at a scale.


Talk Title	Multinode restricted Boltzmann machines for big data
Speakers	Nikolay Manchev (IBM)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

In the age of big data, there has been unprecedented growth in the amount of data available for analysis, but handling unstructured and semistructured data is a challenging task that prompts organizations to discard a substantial amount of data. Artificial neural networks (ANNs) have been successfully used for imposing structure over unstructured data, by means of unsupervised feature extraction and nonlinear pattern detection. Restricted Boltzmann machines (RBMs), for example, have been shown to have a wide range of applications in this context: they can be used as generative models for dimensionality reduction, classification, collaborative filtering, extraction of semantic document representation, and more. RBMs are also used as building blocks for the multilayer learning architecture of deep belief networks. Training RBMs against a big dataset, however, is problematic. When operating with millions and billions of parameters, the parameter estimation process for a conventional, nonparallelized RBM can take weeks. In addition, the constraints of using a single machine for model fitting introduces another limitation that negatively impacts scalability. Numerous attempts have been made to overcome the aforementioned limitation—most of them involving computations using GPUs. Studies have shown that this approach can reduce the training time for an RBM-based deep belief network from several weeks to a single day. On the other hand, using GPU-based training also presents certain challenges. GPUs impose a limit on the amount of memory available for the computation, thus limiting the model in terms of size. Stacking multiple GPUs together is inefficient due to the communication-induced overhead and the increased economic costs. There are also limitations arising from memory transfer times and thread synchronization. Nikolay Manchev explores an implementation of a CPU-based, parallelized version of the restricted Boltzmann machine created as a collaboration between IBM and City University London. The research team created a custom implementation of a restricted Boltzmann machine that runs on top of Apache SystemML, a declarative large-scale machine-learning platform, and carried out a number of tests with various datasets, using RBMs as feature extractors and feeding the outputs to different classification algorithms (support vector machines, decision trees, multinomial logistic regression, etc.). Nikolay offers an overview of the research and the current state of this stochastic ANN model in the context of big data, as well as future plans. Along the way, he also discusses how SystemML alleviates certain big data challenges (e.g., using cost-based optimization for distributed matrix operations) and why the team chose it as a foundation for its machine-learning problem.

Multinode restricted Boltzmann machines for big data

Paint the landscape and secure your data center with Apache Spot

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more

Compressed linear algebra in Apache SystemML

Sparklyr: An R interface for Apache Spark

The future of column-oriented data processing with Arrow and Parquet

Transport for London: Using data to keep London moving