October 28, 2019

313 words 2 mins read

Atom smashing using machine learning at CERN

Atom smashing using machine learning at CERN

Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes.


Talk Title	Atom smashing using machine learning at CERN
Speakers	Siddha Ganju (NVIDIA)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes. Reproducibility is necessary so that any process can be simulated at different times. Some processes may be more popular and hence need to be made easily accessible. Users access this data from replicas of data stored in specified places, but creating numerous replicas of every dataset is not feasible, so predicting which datasets might become popular is necessary. Siddha explains how CERN solved the classification problem which finds if a dataset will become popular or not by calculating binary values of popular (1 / TRUE) or unpopular (0 / FALSE), giving an example with toy data. (Actual data cannot be disclosed.) After finding which dataset is popular, CERN still had to decide which machine-learning algorithm suits the procedure best. Three algorithms were employed, naive Bayes, stochastic gradient descent, and random forest. These models were combined into an ensemble to check which algorithm offers the best true positive, true negative, false positive, or false negative value. Siddha details how this process offers better data analysis, leading to parallel, real-time processing of the distributed data that is abundantly available in CMS.

algorithm dataset cms machine learning

comments powered by Disqus

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

October 27, 2019

Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.

Faster conclusions using in-memory columnar SQL and machine learning

Faster conclusions using in-memory columnar SQL and machine learning

October 25, 2019

Hadoops traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions.

How to make analytic operations look more like DevOps: Lessons learned moving machine-learning algorithms to production environments

How to make analytic operations look more like DevOps: Lessons learned moving machine-learning algorithms to production environments

October 24, 2019

There is a big difference between running a machine-learning algorithm manually from time to time and building a production system that runs thousands of machine-learning algorithms each day on petabytes of data, while also dealing with all the edge cases that arise. Robert Grossman discusses some of the lessons learned when building such a system and explores the tools that made the job easier.

How to turn your house into a robot: An adaptive-learning algorithm for the Internet of Things

How to turn your house into a robot: An adaptive-learning algorithm for the Internet of Things

October 24, 2019

Modern houses and robots have a lot in common. Both have a lot of sensors and have to make a lot of decisions. However, unlike houses, robots adapt and perform helpful tasks. Brandon Rohrer details an algorithm specifically designed to help houses, buildings, roads, and stores learn to actively help the people that use them.

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

October 20, 2019

Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning.

We enhance privilege with supervised machine learning

We enhance privilege with supervised machine learning

October 18, 2019

Machines are not objective, and big data is not fair. Michael Williams uses sentiment analysis to show that supervised machine learning has the potential to amplify the voices of the most privileged people in society, violate the spirit and letter of civil rights law, and make your product suck.