January 10, 2020

290 words 2 mins read

Herding elephants: Seamless data access in a multicluster clouds

Herding elephants: Seamless data access in a multicluster clouds

Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms.

Talk Title Herding elephants: Seamless data access in a multicluster clouds
Speakers Pradeep Bhadani (Hotels.com), Elliot West (Hotels.com)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date April 30-May 2, 2019
URL Talk Page
Slides Talk Slides

Expedia Group is in the process of migrating its Hadoop infrastructure from a single organization-wide on-premises cluster to large numbers of smaller in-cloud clusters. It has also moved from a centralized operating model, where one team was responsible for the Hadoop platform, to a distributed approach, where infrastructure is owned and operated by the group’s different brands: Hotels.com, Expedia.com, HomeAway.com, etc. This segmentation of data platforms has allowed the company to realize greater agility, resource elasticity, and reduced costs. However, it has generated architectural fragmentation, creating cloud-based data silos that impeded the ability to explore, discover, and share data across the organization. Pradeep Bhadani and Elliot West describe these technical challenges and the solutions that were developed to provide users with a virtual unified view of the company’s many data lakes. They then offer an overview of Apiary, an open source project that provides a standardized pattern for deploying and operating data lakes that support a federated dataset sharing across accounts, regions, and clouds; a “bring your own tool” culture, supporting a broad range of data processing platforms in the Hadoop ecosystem; replication of datasets for disaster recovery; and data access security.

comments powered by Disqus