SETL: An efficient and predictable way to do Spark ETL

Common ETL jobs used for importing log data into Hadoop clusters require a considerable amount of resources, which varies based on the input size. Thiruvalluvan M G shares a set of techniquesinvolving an innovative use of Spark processing and exploiting features of Hadoop file formatsthat not only make these jobs much more efficient but also work well with fixed amounts of resources.


Talk Title	SETL: An efficient and predictable way to do Spark ETL
Speakers	Thiruvalluvan M G (Aqfer)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 26-28, 2017
URL	Talk Page
Slides	Talk Slides
Video

A common and important workload for Hadoop is a set of Spark jobs to process log data from different sources such as web servers, beacons, the internet of things, or other devices. These jobs produce clean, granular data in a fixed output schema for further processing and mark those input records that could not be processed because they are malformed or due to other reasons. They also produce simple process and business aggregates, such as the total number of records processed or split country-wise. These jobs are run periodically, typically once to several times an hour. The challenge of these jobs is that they require a significant amount of resources, and the resource requirements vary with input size, which in turn vary with time of the day or day of the week. As a result, you are forced to provision for the peak input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixed amounts of resources. The first technique generates clean, granular data as a side effect of Spark transformations rather than the traditional way of saving Spark DataFrames or RDDs, skipping multiple scans of granular data and completely avoiding serialization and deserialization of them. The second employs efficient file merge operations to avoid output fragmentation, leading to predictable resource requirements for the jobs. (All standard file formats—CSV, JSON, Avro, ORC, and Parquet—lend themselves to efficient file merge operations.) The third uses low-level APIs to write the output file formats and avoid intermediate layers of inefficiency, which are designed for generality rather than efficiency. Thiru demonstrates how employing these techniques can achieve a more than 83% reduction in container memory and a 60% reduction in CPU utilization.

SETL: An efficient and predictable way to do Spark ETL

The columnar roadmap: Apache Parquet and Apache Arrow

The Rise of Open Source in the Manufacturing Industry

Tutorial: Service Orchestration - Way Beyond SDNFV Orchestration - Presented by MEF

Building an Edge Computing Platform for Network Services Using Cloud Native Technology [I]

Building containerized Spark on a solid foundation with Quobyte and Kubernetes

Executive Briefing: Dealing with device data