The evolution of metadata: LinkedIns story

Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIns metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.


Talk Title	The evolution of metadata: LinkedIns story
Speakers	Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 24-26, 2019
URL	Talk Page
Slides	Talk Slides
Video

LinkedIn began with a series of fundamental questions at the heart of its metadata evolution—what metadata is, what data constructs it applies to, when it should be collected, when and how it should be stored, what you can do with it, and how you can scale it to a million data constructs, thousands of people, and hundreds of teams. The journey started with a small team trying to improve the searchability of Hadoop data. Over the years, this system has grown to be the central data hub where the entirety (more than a million) of data assets at LinkedIn (online, streaming, and batch) have a home. This system is deployed at global scale and powers data productivity for all engineers and data enthusiasts while serving as critical infrastructure for data privacy by default in LinkedIn’s data systems. Shirshanka Das and Mars Lan examine different metadata strategies for modeling metadata, storing metadata, and then scaling the acquisition and refinement of metadata for thousands of metadata authors and producing systems. They dive into the pros and cons of each strategy and in which scenarios they think organizations should deploy them. They explore strategies including generic types versus specific types, crawling versus publish/subscribe, single source of truth versus multiple federated sources of truth, automated classification of data, lineage propagation, and more. They also outline different axes on which they’ve been tested on scale, the sheer number of entities, the richness of metadata, the connectivity between entities, the velocity of evolution of the metadata model, and the efficiency of serving metadata for simple and complex queries. You’ll see the metadata system LinkedIn has innovated on over the years that allows for rich extensible types, supports different types of data entities, and provides efficient storage and retrieval of metadata in both site-serving use cases and graph-analytic use cases and scales well to support distributed development models. They’ll outline the relationship of this metadata system to other well known systems like the Hive metastore, the Kafka schema registry, Apache Atlas, and Cloudera Navigator. While the storage abstractions and metadata models are key to a scalable system, without an intuitive interface and UX for this metadata, the understandability of the overall ecosystem is severely limited. Shirshanka and Mars detail the design challenges faced in making metadata insightful for data producers and consumers and what strategies have worked.

The evolution of metadata: LinkedIns story

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

Unleashing Apache Kafka and TensorFlow in hybrid architectures

Cadence: Developer Oriented Workflow Platform

Stream, stream, stream: Different streaming methods with Spark and Kafka

(Unifying analytics and AI on big data for faster insights at scale)

Data science at Deutsche Telekom: Predicting global travel patterns and network demand