January 25, 2020

242 words 2 mins read

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.


Talk Title	Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)
Speakers	Mathew Lodge (Anaconda)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Big data architectures like Hadoop and Spark solve the distributed database problem well but have as an article of faith that moving compute closer to data is important for performance. They also assume your code is written in Java or another JVM-based language like Scala. The big problem? Data science, predictive analytics, and ML don’t happen in JVM-based languages. They happen in Python, R, and to a lesser extent C/C++. Secondly, today’s data center networks have 1,000 times the bandwidth at a lower total cost versus 2005, when Hadoop was first conceived, meaning that data locality doesn’t matter so much. Lastly, all the major players like AWS, Microsoft, Google, IBM, Red Hat, and Docker are lined up behind Kubernetes. Containers and Kubernetes make great language-agnostic distributed computing clusters: it’s just as easy to deploy Python as it is Java. Mathew Lodge shows you how. This session is sponsored by Anaconda.

ibm data science database big data python cluster google ml network data center container code math hadoop aws performance java spark microsoft analytics docker cloud kubernetes

comments powered by Disqus

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

January 25, 2020

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.

Pangeo: Big data climate science in the cloud

Pangeo: Big data climate science in the cloud

January 6, 2020

Climate science is being flooded with petabytes of data, overwhelming traditional modes of data analysis. The Pangeo project is building a platform to take big data climate science into the cloud using SciPy and large-scale interactive computing tools. Join Ryan Abernathey and Yuvi Panda to find out what the Pangeo team is building and why and learn how to use it.

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

December 24, 2019

John Dohoney and Kaitlin Carter walk you through deploying the SMACK stack on DC/OS. This architecture enables you to create modern streaming applications that make use of NoSQL databases with Cassandra and message streaming with Apache Kafka using analytics streaming with Apache Spark, all running under Apache Mesos implemented with Akka streaming and asynchronous Java libraries under DC/OS.

Distributed training of deep learning models

Distributed training of deep learning models

December 10, 2019

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Deploying Hyperledger Fabric with Kubernetes/Helm

Deploying Hyperledger Fabric with Kubernetes/Helm

November 8, 2019

Deploying Hyperledger Fabric to production on Kubernetes is not a solved topic, AID:Tech present their work on designing and open-sourcing Helm Charts.Rather than developing a monolithic Helm chart, A …

What's the Hadoop-la about Kubernetes?

What's the Hadoop-la about Kubernetes?

January 16, 2020

Kubernetes (K8s)the open source container orchestration system for modern big data workloadsis increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.