January 25, 2020

410 words 2 mins read

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

Comcasts streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro.

Talk Title Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types
Speakers Barbara Eckman (Comcast)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 11-13, 2018
URL Talk Page
Slides Talk Slides
Video

Comcast’s streaming data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. At last year’s Strata New York, speakers from Comcast explained how the company extended Apache Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous Kafka messaging. Comcast recently integrated on-prem data sources, including Hadoop-based traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Barbara Eckman details how Comcast met that challenge, offering an overview of the federated architecture, in which Atlas provides SQL-like free text and graph search across select metadata from a wide variety of on-prem and public cloud data. Lightweight, custom connectors and bridges identify metadata and lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining. Comcast provides end-to-end lineage for both batch and streaming processes, identifying, for example, the on-prem relational and Hive precursors of objects in the public cloud data lake. Barbara outlines how Comcast extends its data governance practices to include not only Avro but also relational and JSON schemas. A data maturity model represents the schema type, the richness of its documentation, and the level of operational support that the datasource boasts. A heterogeneous schema registry still provides Avro schemas for SerDe but extends such features as schema evolution to other schema types. Comcast then captures semantic mappings of heterogeneous schemas to a set of common, company-wide data models.

comments powered by Disqus