December 13, 2019

233 words 2 mins read

Scaling HBase for big data (sponsored by Salesforce)

Scaling HBase for big data (sponsored by Salesforce)

Even though HBase is considered a highly scalable distributed solution, there are cases where the schema design of HBase tables or the way a client uses an HBase cluster may impact the scalability factor of HBase. Ranjeeth Karthik Selvan Kathiresan and Gurpreet Multani outline the most important things to consider when scaling your HBase cluster to accommodate high-volume and high-velocity data.


Talk Title	Scaling HBase for big data (sponsored by Salesforce)
Speakers	Ranjeeth Karthik Selvan Kathiresan (Salesforce), Gurpreet Multani (Salesforce.com)
Conference	O’Reilly Velocity Conference
Conf Tag	Build Resilient Distributed Systems
Location	San Jose, California
Date	June 20-22, 2017
URL	Talk Page
Slides	Talk Slides
Video

Apache HBase, a NoSQL data storage solution, is primarily used to store large amounts of data, which can later be accessed for batch processing and real-time processing. Even though HBase is considered a highly scalable distributed solution, there are cases where the schema design of HBase tables or the way a client uses an HBase cluster may impact the scalability factor of HBase. Ranjeeth Karthik Selvan Kathiresan and Gurpreet Multani outline the most important things to consider when scaling your HBase cluster to accommodate high-volume and high-velocity data. Ranjeeth and Gurpreet also explain how Salesforce resolved its HBase scalability issues and share Salesforce’s journey in turning its HBase cluster from a low-performing cluster to one that is highly scalable and better performing. This session is sponsored by Salesforce.

apache sql big data scalable cluster

comments powered by Disqus

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

Big data computations: Comparing Apache HAWQ, Druid, and GPU databases

December 5, 2019

The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.

Building deep learning-powered big data

Building deep learning-powered big data

December 4, 2019

Radhika Rangarajan explains how Intel works with its users to build deep learning-powered big data analytics applications (object detection, image recognition, NLP, etc.) using BigDL.

Hadoop and object stores: Can we do it better?

Hadoop and object stores: Can we do it better?

December 3, 2019

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.

Building a Secure, Multi-Protocol and Multi-Tenant Cluster for Internet-Facing Services [A]

Building a Secure, Multi-Protocol and Multi-Tenant Cluster for Internet-Facing Services [A]

December 10, 2019

Exposing internal HTTP-based services to the Internet is a well supported and documented feature of Kubernetes. What's less well understood is how to do it for thousands of services running on behalf …

Accelerate analytics and AI innovations with Intel (sponsored by Intel)

Accelerate analytics and AI innovations with Intel (sponsored by Intel)

December 5, 2019

Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.

Architecting a next-generation data platform

Architecting a next-generation data platform

December 5, 2019

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.