January 3, 2020

342 words 2 mins read

A recommendation system for wide transactions

A recommendation system for wide transactions

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.

Talk Title A recommendation system for wide transactions
Speakers Bargava Subramanian (Binaize), Harjindersingh Mistry (Ola)
Conference Strata + Hadoop World
Conf Tag Make Data Work
Location Singapore
Date December 6-8, 2016
URL Talk Page
Slides Talk Slides
Video

Many applications we use today are powered by the cloud and mobile. One of the critical components that drives engagement for cloud platforms is the recommendation engine. Recommendation systems are becoming pervasive, but as both users and the number of products offered on a platform scale, we are hit with two distinct challenges: engineering and machine learning. Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud. Bargava and Harjinder define wide data as that in which the number of transactions in a transaction basket is greater than 1,000. Some examples of big and wide data include the financial Instruments traded by a portfolio manager in a day, the products shipped from a warehouse, and the software components in a cloud platform. Standard approaches to wide data have been market basket analysis (frequent pattern mining), collaborative filtering (matrix factorization), and deep learning. Apache Spark lends itself nicely to building a data science pipeline, from ingestion to data processing and machine learning. But as the data becomes wider, model training performance takes a hit. Bargava and Harjinder explain how they used the Alternating Least Squares algorithm in Spark to generate frequent itemsets. The new approach was faster and scaled well for big and wide data.

comments powered by Disqus