December 15, 2019

217 words 2 mins read

How to scale a distributed system

How to scale a distributed system

It seems like everyone is building a distributed system. However, there's no common body of knowledge about how these systems should be built and scaled, beyond what is squirreled away in various academic papers. Henry Robinson shares lessons learned from over eight years spent building distributed systems and outlines a framework for thinking about distributed scaling challenges.


Talk Title	How to scale a distributed system
Speakers	Henry Robinson (Cloudera)
Conference	O’Reilly Velocity Conference
Conf Tag	Build Resilient Distributed Systems
Location	San Jose, California
Date	June 20-22, 2017
URL	Talk Page
Slides	Talk Slides
Video

Despite the continuing high industrial demand for building new distributed systems, there are few institutionalized, commonly applicable techniques and design approaches like those found in other engineering disciplines. Practitioners are left to learn the same lessons over and over again, either through hard-won experience or by stumbling across a relevant paragraph in an academic paper. Henry Robinson shares practical lessons learned from more than eight years spent building distributed systems using the Hadoop ecosystem (including Apache Zookeeper, Apache Flume, Apache Impala, and more), focusing on the thorny question of how to scale a distributed system. Henry outlines a framework for thinking about the problems of scale (in many dimensions) and effectively navigating the phase transitions between 10-, 100-, and 1,000-node deployments. Topics include:

framework apache ecosystem hadoop distributed system paper

comments powered by Disqus

Architecting a next-generation data platform

Architecting a next-generation data platform

December 5, 2019

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …

How to secure Apache Spark?

How to secure Apache Spark?

December 2, 2019

Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of setting up security and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way.

Kubernetes Ingress Controller with Apache Traffic Server [I]

Kubernetes Ingress Controller with Apache Traffic Server [I]

November 29, 2019

Today, the Oath Media Brands and Products container platform is serving critical application workloads like Yahoo Sports and Yahoo Finance at a large scale using Kubernetes as the orchestration framew …

The state of Spark in the cloud

The state of Spark in the cloud

November 29, 2019

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

Distinguish pop music from heavy metal using Apache Spark MLlib

Distinguish pop music from heavy metal using Apache Spark MLlib

November 25, 2019

Taras Matyashovsky explains how to use Apache Spark MLlib to build a supervised learning NLP pipeline to distinguish pop music from heavy metaland have fun in the process.