February 11, 2020

301 words 2 mins read

Parquet modular encryption: Confidentiality and integrity of sensitive column data

Parquet modular encryption: Confidentiality and integrity of sensitive column data

The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.

Talk Title Parquet modular encryption: Confidentiality and integrity of sensitive column data
Speakers Gidon Gershinsky (IBM)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 24-26, 2019
URL Talk Page
Slides Talk Slides
Video

Apache Parquet is a popular columnar format, leveraged in many analytic frameworks for efficient storage and processing of big data. In many real-life use cases, parts of the data are highly sensitive and must be protected. The Parquet community is working on a column encryption mechanism that secures confidentiality and integrity of the sensitive Parquet data and enables access control for table columns. The modular design of the mechanism preserves the existing projection, predicate pushdown, encoding, and compression capabilities of Parquet, which are required for analytic workload acceleration. Many leading companies in the big data and cloud domains are taking part in the community work on this technology. The specification of the Parquet modular encryption has been recently completed and formally approved by the Apache Parquet project management committee (PMC). Gidon Gershinsky explains the basics of the columnar encryption technology, its usage model, and an initial integration with analytic frameworks (e.g., Apache Spark). He details two use cases—one related to connected cars (location, speed, and other sensitive data), another to healthcare data processing (medical sensor records, managed by the increasingly popular HL7 Fast Healthcare Interoperability Resources (FHIR) standard). And he explores the performance implications of applying modular encryption in analytic workloads.

comments powered by Disqus