December 28, 2019

205 words 1 min read

Pachyderm: Unlock the Power of Kubernetes for Big Data

Pachyderm: Unlock the Power of Kubernetes for Big Data

Pachyderm is an open source big data analytics platform completely deployed on Kubernetes. Pachyderm leverages K8s's jobs API to process massive data workloads and build streaming pipelines. Pachyd …


Talk Title	Pachyderm: Unlock the Power of Kubernetes for Big Data
Speakers	Joey Zwicker
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	Seattle, WA, United States
Date	Nov 7- 9, 2016
URL	Talk Page
Slides	Talk Slides
Video

Pachyderm is an open source big data analytics platform completely deployed on Kubernetes. Pachyderm leverages K8s’s jobs API to process massive data workloads and build streaming pipelines. Pachyderm’s hallmark feature is version-controlled data including viewing branches, commits and diffs for petabyte-scale data sets. In this talk we’ll demonstrate how Kubernetes and Pachyderm empowers data science teams to collaborate on a shared and unified data infrastructure. Everything is run on Kubernetes including streaming data ingestion, machine learning pipelines, to automatic service deployment using Rolling Updates. Our talk will discuss how Pachyderm couldn’t exist without a large swath of advanced Kubernetes primitives and includes demo where we stream data through the system and watch Kubernetes automatically schedule analytics containers and parallelize the data processing. This demo is inspired directly by how production users are managing data in Pachyderm today.

container api streaming k8s data set data science data analytics analytics infrastructure open source big data machine learning pipeline kubernetes

comments powered by Disqus

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

November 17, 2019

Moty Fania shares Intels IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platforms architecture.

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing

December 12, 2019

Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies

December 10, 2019

David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

What's next for BDAS (the Berkeley Data Analytics Stack)?

What's next for BDAS (the Berkeley Data Analytics Stack)?

October 18, 2019

Michael Franklin offers an overview of the Berkeley Data Analytics Stack, outlines the current directions it's taking, and settles once and for all how BDAS should be pronounced.

Ten Lessons From Telemetry

Ten Lessons From Telemetry

December 27, 2019

Streaming telemetry data enables network operators to bring network monitoring out of the 20th century and into the rich space of big data analytics. Having spent …