November 7, 2019

365 words 2 mins read

Compressed linear algebra in Apache SystemML

Compressed linear algebra in Apache SystemML

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Talk Title Compressed linear algebra in Apache SystemML
Speakers Frederick Reiss (IBM), Arvind S (IBM)
Conference Strata + Hadoop World
Conf Tag Big Data Expo
Location San Jose, California
Date March 14-16, 2017
URL Talk Page
Slides Talk Slides
Video

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Running these algorithms over big data requires large numbers of machines with large amounts of RAM, which can quickly become very expensive. Compressing the matrices with general purpose algorithms like gzip doesn’t improve performance because decompression speed is on par with that of reading data from disk. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines. Compressed linear algebra uses actionable compression to represent matrices of training data. Unlike general-purpose compression, actionable compression allows operations to proceed directly over the compressed data. Frederick and Arvind show that it is possible to implement critical linear algebra operations in the compressed domain, delivering performance that matches, and in some cases greatly exceeds, conventional numerical libraries operating over uncompressed data. Frederick and Arvind then describe an end-to-end implementation of compressed linear algebra in Apache SystemML, a language and system for implementing scalable machine-learning algorithms on Apache Spark and Hadoop MapReduce. Incorporating compressed linear algebra into SystemML’s runtime and optimizer can achieve performance improvements of more than 25x with no changes to the algorithm code. Frederick and Arvind start with a brief description of Apache SystemML before using instructive examples in SystemML’s R-like domain-specific language to describe the problem of fitting large training sets into the main memory. Frederick and Arvind conclude with detailed, end-to-end performance results involving key machine-learning algorithms and reference datasets.

comments powered by Disqus