January 3, 2020

342 words 2 mins read

A recommendation system for wide transactions

A recommendation system for wide transactions

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.


Talk Title	A recommendation system for wide transactions
Speakers	Bargava Subramanian (Binaize), Harjindersingh Mistry (Ola)
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	Singapore
Date	December 6-8, 2016
URL	Talk Page
Slides	Talk Slides
Video

Many applications we use today are powered by the cloud and mobile. One of the critical components that drives engagement for cloud platforms is the recommendation engine. Recommendation systems are becoming pervasive, but as both users and the number of products offered on a platform scale, we are hit with two distinct challenges: engineering and machine learning. Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud. Bargava and Harjinder define wide data as that in which the number of transactions in a transaction basket is greater than 1,000. Some examples of big and wide data include the financial Instruments traded by a portfolio manager in a day, the products shipped from a warehouse, and the software components in a cloud platform. Standard approaches to wide data have been market basket analysis (frequent pattern mining), collaborative filtering (matrix factorization), and deep learning. Apache Spark lends itself nicely to building a data science pipeline, from ingestion to data processing and machine learning. But as the data becomes wider, model training performance takes a hit. Bargava and Harjinder explain how they used the Alternating Least Squares algorithm in Spark to generate frequent itemsets. The new approach was faster and scaled well for big and wide data.

financial collaborative apache performance algorithm spark data engineering data science deep learning machine learning cloud mobile pipeline elasticsearch

comments powered by Disqus

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

October 27, 2019

Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

December 30, 2019

R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analyticsall within R.

Data science at eHarmony: A generalized framework for personalization

Data science at eHarmony: A generalized framework for personalization

December 14, 2019

Data science has always been a focus at eHarmony, but recently more business units have needed data-driven models. Jonathan Morra introduces Aloha, an open source project that allows the modeling group to quickly deploy type-safe accurate models to production, and explores how eHarmony creates models with Apache Spark and how it uses them.

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

December 10, 2019

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Bringing deep learning into big data analytics using BigDL

Bringing deep learning into big data analytics using BigDL

January 2, 2020

Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.

Managing machine learning models in production

Managing machine learning models in production

December 30, 2019

There are many challenges to deploying machine models in production, including managing multiple versions of models, maintaining staging and production models, keeping track of model performance, logging, and scaling. Anand Chitipothu explores the tools, techniques, and system architecture of a cloud platform built to solve these challenges and the new opportunities it opens up.