December 1, 2019

235 words 2 mins read

Meta-data science: When all the world's data scientists are just not enough

Meta-data science: When all the world's data scientists are just not enough

What if you had to build more models than there are data scientists in the worlda feat enterprise companies serving hundreds of thousands of businesses often have to do? Leah McGuire offers an overview of Salesforce's general-purpose machine-learning platform that automatically builds per-company optimized models for any given predictive problem at scale, beating out most hand-tuned models.


Talk Title	Meta-data science: When all the world's data scientists are just not enough
Speakers	Leah McGuire (Salesforce)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

Due to privacy concerns and the nature of SaaS businesses, platforms like CRM systems often have to provide intelligent data-driven features that are built from many different unique, per-customer machine-learned models. In the case of Salesforce, this entails building hundreds of thousands of models tuned for as many distinctly different customers for any given data-driven application. Leah McGuire offers an overview of Salesforce’s Einstein, a homegrown Spark ML-based machine-learning platform. Einstein’s automated feature engineering results in much quicker modeling turnarounds and higher accuracy than general-purpose modeling libraries such as scikit-learn; its automatic hyperparameter optimization, feature selection, and model selection result in a very good model for each specific customer; it includes modular workflows and transformations that complement systems like Spark ML and KeystoneML; and it offers huge scale that enables training thousands of models per day.

automated intel spark scikit-learn ml data science privacy optimization data-driven saas

comments powered by Disqus

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more

November 30, 2019

Kamran Yousaf explains how to substantially accelerate and radically simplify common practices in machine learning, such as running a trained model in production, to meet real-time expectations, using Redis modules that natively store and execute common models generated by Spark ML and TensorFlow algorithms.

Developer on the rise: Blurring the line between developer and data scientist with PixieDust

Developer on the rise: Blurring the line between developer and data scientist with PixieDust

November 26, 2019

Ready to dip your toe into data science? Va Barbosa explains why you should start with notebooks and PixieDust, a new open source library that helps data scientists and developers working in the Jupyter Notebook and Apache Spark be more efficient.

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

November 2, 2019

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Real-time intelligence gives Uber the edge

Real-time intelligence gives Uber the edge

December 1, 2019

M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.

Reliable prediction: Handling uncertainty

Reliable prediction: Handling uncertainty

November 30, 2019

Reliable prediction is the ability of a predictive model to explicitly measure the uncertainty involved in a prediction without feedback. Robin Senge shares two approaches to measure different types of uncertainty involved in a prediction.

Spark and R with sparklyr

Spark and R with sparklyr

November 30, 2019

R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.