Parquet modular encryption: Confidentiality and integrity of sensitive column data
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.
Talk Title | Parquet modular encryption: Confidentiality and integrity of sensitive column data |
Speakers | Gidon Gershinsky (IBM) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 24-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Apache Parquet is a popular columnar format, leveraged in many analytic frameworks for efficient storage and processing of big data. In many real-life use cases, parts of the data are highly sensitive and must be protected. The Parquet community is working on a column encryption mechanism that secures confidentiality and integrity of the sensitive Parquet data and enables access control for table columns. The modular design of the mechanism preserves the existing projection, predicate pushdown, encoding, and compression capabilities of Parquet, which are required for analytic workload acceleration. Many leading companies in the big data and cloud domains are taking part in the community work on this technology. The specification of the Parquet modular encryption has been recently completed and formally approved by the Apache Parquet project management committee (PMC). Gidon Gershinsky explains the basics of the columnar encryption technology, its usage model, and an initial integration with analytic frameworks (e.g., Apache Spark). He details two use cases—one related to connected cars (location, speed, and other sensitive data), another to healthcare data processing (medical sensor records, managed by the increasingly popular HL7 Fast Healthcare Interoperability Resources (FHIR) standard). And he explores the performance implications of applying modular encryption in analytic workloads.