October 21, 2019

165 words 1 min read

Scalable schema management for Hadoop and Spark applications

Scalable schema management for Hadoop and Spark applications

Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient.


Talk Title	Scalable schema management for Hadoop and Spark applications
Speakers	Kelvin Chu (Uber), Evan Richards (Uber)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Schema plays a key role in the Hadoop architecture at Uber. Uber has a complex environment of many data sources (key-value stores, Kafka, relational DBs) and many data producer/consumer combinations. Kelvin Chu and Evan Richards discuss Uber’s internal systems and tools for schema creation, inference, validation, evolution, and migration, covering motivations and results. Kelvin and Evan share their experience implementing and optimizing Uber’s data producer clients in four languages—Python, Node.js, Java, and Go—and explain how they leverage Spark to do efficient schema inference, data migration, and scalable computation.

kafka management spark hadoop uber python scalable

comments powered by Disqus

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Real-time Hadoop: What an ideal messaging system should bring to Hadoop

Real-time Hadoop: What an ideal messaging system should bring to Hadoop

October 21, 2019

Application messaging isnt newsolutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer.

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

October 20, 2019

Celtra provides a platform for customers like Porsche and Fox to create, track, and analyze digital display advertising. Celtra's platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Grega Kepret outlines Celtras data-pipeline challenges and explains how it solved them by combining Snowflake's cloud data warehouse with Spark.

Transforming Telefnica

Transforming Telefnica

October 19, 2019

Increasing competition and technological change is impelling the telco industry toward a new model of analytics. Telefnica has been at the front of this change, driving business transformation to a digital telco. John Belchamber and Arturo Canales tell the story of that transformation and detail the pitfalls and challenges faced by teams looking to follow a similar journey.

Scale your code with Scala.js

Scale your code with Scala.js

October 14, 2019

Web apps are complex and comprised of many technologies. It can be difficult to simultaneously scale large server and client codebases. Scala is an expressive, performant language that can now run in your browser as well as on the JVM. Paul Draper explains how Scala's presence on the two most ubiquitous runtimes greatly assists web developers.