January 9, 2020

411 words 2 mins read

A hands-on data science crash course for modeling and predicting the behavior of (large) distributed systems

A hands-on data science crash course for modeling and predicting the behavior of (large) distributed systems

Data science is a hot topic. Bart De Vylder and Pieter Buteneers offer a practical introduction that goes beyond the hype, exploring data analysis, visualization, and machine learning techniques using Python for modeling the behavior of distributed systems. You'll leave with a solid starting point to implement data science techniques in your infrastructure or domain of interest.

Talk Title A hands-on data science crash course for modeling and predicting the behavior of (large) distributed systems
Speakers Bart De Vylder (CoScale), Pieter Buteneers (CoScale)
Conference O’Reilly Velocity Conference
Conf Tag Build resilient systems at scale
Location New York, New York
Date October 2-4, 2017
URL Talk Page
Slides Talk Slides
Video

Data science is a hot topic. However, the high number of available software libraries, languages, and platforms is often overwhelming for those who want to get started in the field. Bart De Vylder and Pieter Buteneers offer a practical introduction that goes beyond the hype, exploring data analysis and modeling techniques applied to the behavior of distributed systems. Using hosted IPython notebooks and a real-world dataset of monitoring data originating from a nontrivial distributed application, consisting of both stateful and stateless services communicating over a message bus, Bart and Pieter walk you through the Python scientific ecosystem (NumPy, SciPy, and scikit-learn) as they demonstrate different data visualization techniques that help the interpretation of the data and the models built from it. Bart and Pieter then discuss data clustering techniques, such as those to automatically discover which servers or containers are running in a load-balanced fashion, and show you how to apply correlation analysis and dimensionality reduction techniques. Modern monitoring systems easily capture tens of thousands of metrics, but many of these metrics are highly correlated and don’t convey much extra information. Applying dimensionality reduction techniques to automatically discover these correlations helps in understanding and visualizing the data and is a step in the process of preparing and modeling the data. Bart and Pieter also outline supervised machine learning techniques to model data and touch on the important concepts of overfitting and cross-validation, considering the advantages and disadvantages of both simple linear techniques and more advanced ones. They then explain how to put these models in action and make predictions, discussing techniques for performing what-if analysis related to capacity planning (e.g., which resource will be the next bottleneck if the number of web requests keeps increasing?) and robustness (e.g., what is the impact on a service’s SLA if a node falls out?).

comments powered by Disqus