SETL: An efficient and predictable way to do Spark ETL
Common ETL jobs used for importing log data into Hadoop clusters require a considerable amount of resources, which varies based on the input size. Thiruvalluvan M G shares a set of techniquesinvolving an innovative use of Spark processing and exploiting features of Hadoop file formatsthat not only make these jobs much more efficient but also work well with fixed amounts of resources.
Talk Title | SETL: An efficient and predictable way to do Spark ETL |
Speakers | Thiruvalluvan M G (Aqfer) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 26-28, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
A common and important workload for Hadoop is a set of Spark jobs to process log data from different sources such as web servers, beacons, the internet of things, or other devices. These jobs produce clean, granular data in a fixed output schema for further processing and mark those input records that could not be processed because they are malformed or due to other reasons. They also produce simple process and business aggregates, such as the total number of records processed or split country-wise. These jobs are run periodically, typically once to several times an hour. The challenge of these jobs is that they require a significant amount of resources, and the resource requirements vary with input size, which in turn vary with time of the day or day of the week. As a result, you are forced to provision for the peak input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixed amounts of resources. The first technique generates clean, granular data as a side effect of Spark transformations rather than the traditional way of saving Spark DataFrames or RDDs, skipping multiple scans of granular data and completely avoiding serialization and deserialization of them. The second employs efficient file merge operations to avoid output fragmentation, leading to predictable resource requirements for the jobs. (All standard file formats—CSV, JSON, Avro, ORC, and Parquet—lend themselves to efficient file merge operations.) The third uses low-level APIs to write the output file formats and avoid intermediate layers of inefficiency, which are designed for generality rather than efficiency. Thiru demonstrates how employing these techniques can achieve a more than 83% reduction in container memory and a 60% reduction in CPU utilization.