Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.


Talk Title	Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework
Speakers	Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray—a plug-in-based platform and library built and designed from the ground up by Uber, which will eventually support ingesting data from any source and dispersing it to any sink leveraging the use of Apache Spark. “Marmaray” refers to a tunnel in Turkey that connects Europe and Asia by rail. In the same way, Marmaray was envisioned within Uber as a pipeline connecting raw data from a variety of sources to Hadoop/Hive and connecting both raw and derived datasets from Hive to a variety of sinks depending on SLA, latency, and other customer requirements. The team also added a framework around the core library to support fully self-serve onboarding to lower the difficulty of barrier of entry onto the platform. They also added automated integration with Uber’s workflow management system, which orchestrates and executes ingestion and dispersal jobs on a regular specified cadence. Many data users (e.g., Uber Eats and Uber’s machine learning platform, Michelangelo) use Hadoop in concert with other tools to build and train their machine learning models to ultimately produce derived datasets of immense additional value to drive Uber’s business toward greater efficiency and profitability. In order to maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, often with much lower latency semantics than what existed in the Hadoop ecosystem, in order to serve live traffic. Marmaray was envisioned and designed to fulfill this need and to complete the Hadoop ecosystem to provide the means to transfer Hadoop data out to any online data store. Along the same lines, Uber’s business needs necessitated the ingestion of raw data from a variety of data sources into its Hadoop data lake, which required running and maintaining multiple data pipelines in production. This proved to be cumbersome over time, as the size of the data increased proportionally with Uber’s business growth. The Hadoop platform team at Uber envisioned and designed Marmaray to define a common set of abstractions and provided a framework to unify the ingestion pipelines into one that will prove to be much more maintainable and resource efficient as Uber’s business continues to mature. You’ll learn how the Marmaray team built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library and setting up an on-demand self-service workflow, and how the team leveraged Apache Spark to ensure the platform can scale to handle Uber’s growing data needs. Danny, Omkar, and Eric also explain how its common ingestion framework helped Uber meet GDPR requirements. Uber plans to open-source the framework in 2018.

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Distributed TensorFlow on Hops

Architecting data platforms for cybersecurity

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Why Data Scientists Love Kubernetes