A recommendation system for wide transactions
Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.
Talk Title | A recommendation system for wide transactions |
Speakers | Bargava Subramanian (Binaize), Harjindersingh Mistry (Ola) |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | Singapore |
Date | December 6-8, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Many applications we use today are powered by the cloud and mobile. One of the critical components that drives engagement for cloud platforms is the recommendation engine. Recommendation systems are becoming pervasive, but as both users and the number of products offered on a platform scale, we are hit with two distinct challenges: engineering and machine learning. Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud. Bargava and Harjinder define wide data as that in which the number of transactions in a transaction basket is greater than 1,000. Some examples of big and wide data include the financial Instruments traded by a portfolio manager in a day, the products shipped from a warehouse, and the software components in a cloud platform. Standard approaches to wide data have been market basket analysis (frequent pattern mining), collaborative filtering (matrix factorization), and deep learning. Apache Spark lends itself nicely to building a data science pipeline, from ingestion to data processing and machine learning. But as the data becomes wider, model training performance takes a hit. Bargava and Harjinder explain how they used the Alternating Least Squares algorithm in Spark to generate frequent itemsets. The new approach was faster and scaled well for big and wide data.