2000 Nodes and Beyond: How We Scaled Kubernetes to 60,000-Container Clusters and Where We're Going Next
January 2, 2020
Kubernetes supports 2000-Node clusters - that statement was a part of the Kubernetes 1.3 release announcement. Thats great, but what exactly does it mean? During this talk I will explain what work …
Apache Spark ML and MLlib tuning and optimization: A case study on boosting the performance of ALS by 60x
January 2, 2020
Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.coms production environment.
Best practices with Kudu: An end-to-end user case from the automobile industry
January 2, 2020
Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.
Deploying a scalable JupyterHub environment for running Jupyter notebooks
January 1, 2020
Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business.
R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure
December 30, 2019
R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analyticsall within R.
An architecture for merging fast data and enterprise applications: The SMACK stack
December 28, 2019
Big data architectures and enterprise/microservice architectures are slowly converging. Big data is transitioning to "fast data," emphasizing streaming over batch processing, while data processing is growing ubiquitous. Dean Wampler explores the SMACK stackSpark, Mesos, Akka, Cassandra, and Kafkaand explains how it addresses the needs of both fast data and the enterprise.
POSIX for the data center
December 26, 2019
The container orchestration wars are upon us. A dozen container orchestrators vie to be the kernel of the modern data center. But can the warring parties come together on a standard interface for modern cluster operations? Karl Isenberg explores what these parties have in common and outlines what a common interface might look like for operating these distributed operating systems.
Building a powerful data tier from open source datastores
December 19, 2019
In the past few years, there has been a proliferation of production-ready open source databases, giving developers and operators more choices than ever. Joseph Lynch explores how Yelp has combined complimentary data stores to provide a powerful data tier for our developers. Along the way, Joseph shares lessons learned about deployment, configuration, and monitoring from a production environment.
A practitioners guide to securing your Hadoop cluster
December 16, 2019
Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
Authorization in the cloud: Enforcing access control across compute engines
December 16, 2019
Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.
Breaking Spark: The top five mistakes to avoid when using Apache Spark in production
December 15, 2019
Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian focuses on five common issues observed in a cluster environment setup with Apache Spark (Core, Streaming, and SQL) to help you improve the usability and supportability of Apache Spark and avoid such issues in future deployments.