January 21, 2020

234 words 2 mins read

Introducing Iceberg: Tables designed for object stores

Introducing Iceberg: Tables designed for object stores

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.


Talk Title	Introducing Iceberg: Tables designed for object stores
Speakers	Owen O’Malley (Cloudera), Ryan Blue (Netflix)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:

apache ecosystem open source netflix hdfs big data cloud

comments powered by Disqus

Architecting data platforms for cybersecurity

Architecting data platforms for cybersecurity

December 12, 2019

Data is becoming a crucial weapon to secure an organization against cyber threats. Charaka Goonatilake shares strategies for designing effective data platforms for cybersecurity using big data technologies, such as Spark and Hadoop, and explains how these platforms are being used in real-world examples of data-driven security.

Fast analytics on fast data: Kudu as a storage layer for banking applications

Fast analytics on fast data: Kudu as a storage layer for banking applications

December 9, 2019

Olaf Hein explains how a large German bank relies on a Kudu-based data platform to speed up business processes. Olaf highlights key data access patterns and the system architecture and shares best practices and lessons learned using Kudu in development and operations.

Smart agriculture: Blending IoT sensor data with visual analytics

Smart agriculture: Blending IoT sensor data with visual analytics

November 21, 2019

Mike Prorock offers an overview of mesur.io, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. Mesur.io enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.

What's new in Hadoop 3.0

What's new in Hadoop 3.0

November 19, 2019

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Modernizing operational architecture with big data: Creating and implementing a modern data strategy

Modernizing operational architecture with big data: Creating and implementing a modern data strategy

January 19, 2020

The use of data throughout Cerner had taxed the company's legacy operational data store, data warehouse, and enterprise reporting pipeline to the point where it would no longer scale to meet needs. Jennifer Lim explains how Cerner modernized its corporate data platform with the use of a hybrid cloud architecture.

Panel: Open Networking Driving Data Center and Cloud Innovation

Panel: Open Networking Driving Data Center and Cloud Innovation

January 14, 2020

Open networking encourages innovation through collaboration. Open specifications and the reference code also allow vendors to leverage external networking competencies to formulate and commercialize n …