December 31, 2019

245 words 2 mins read

High-performance enterprise data processing with Spark

High-performance enterprise data processing with Spark

Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.


Talk Title	High-performance enterprise data processing with Spark
Speakers	Vickye Jain (ZS Associates), Raghav Sharma (ZS Associates)
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	Singapore
Date	December 6-8, 2016
URL	Talk Page
Slides	Talk Slides
Video

Enterprises are getting increasingly comfortable with moving traditional workloads to Spark. However, despite its popularity, Spark remains an esoteric technology within enterprises, and many for whom technology is not their core competence, are wary of building internally managed applications on Spark, in part owing to the lack of a steady talent pool and a fear of budget overruns. As such, there is still a constant struggle to balance the ability to support advanced technology platforms within enterprises with matrix organizations, complex funding channels, and business demands. Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. Vickye and Raghav had to negotiate conflicting objectives such as: Vickye and Raghav also offer an overview of the architecture itself, which consists of several elastic clusters, external orchestrators providing full visibility into jobs, a combination of job servers and traditional Spark applications, and deep integration with technical experts with domain experts for rapid development.

performance spark cluster

comments powered by Disqus

Authorization in the cloud: Enforcing access control across compute engines

Authorization in the cloud: Enforcing access control across compute engines

December 16, 2019

Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.

Tuning Spark machine-learning workloads

Tuning Spark machine-learning workloads

December 8, 2019

Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.

Apache Eagle: Secure Hadoop in real time

Apache Eagle: Secure Hadoop in real time

November 21, 2019

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

December 30, 2019

R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analyticsall within R.

An architecture for merging fast data and enterprise applications: The SMACK stack

An architecture for merging fast data and enterprise applications: The SMACK stack

December 28, 2019

Big data architectures and enterprise/microservice architectures are slowly converging. Big data is transitioning to "fast data," emphasizing streaming over batch processing, while data processing is growing ubiquitous. Dean Wampler explores the SMACK stackSpark, Mesos, Akka, Cassandra, and Kafkaand explains how it addresses the needs of both fast data and the enterprise.

Beyond Hadoop at Yahoo: Interactive analytics with Druid

Beyond Hadoop at Yahoo: Interactive analytics with Druid

December 16, 2019

Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.