November 15, 2019

212 words 1 min read

Why is my Hadoop job slow?

Why is my Hadoop job slow?

Hadoop is used to run large-scale jobs over hundreds of machines. Considering the complexity of Hadoop jobs, it's no wonder that Hadoop jobs running slower than expected remains a perennial source of grief for developers. Bikas Saha draws on his experience debugging and analyzing Hadoop jobs to describe the approaches and tools that can solve this difficult problem.


Talk Title	Why is my Hadoop job slow?
Speakers	Bikas Saha (Hortonworks Inc)
Conference	Strata + Hadoop World
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	June 1-3, 2016
URL	Talk Page
Slides	Talk Slides
Video

Hadoop is used to run large-scale jobs that are subdivided into many tasks that are executed over multiple machines. There are complex dependencies between these tasks, and at scale, there can be thousands of tasks running over thousands of machines, which makes it difficult to make sense of their performance. Add to that pipelines that logically run a business workflow as another level of complexity, and it’s no wonder that Hadoop jobs running slower than expected remains a perennial source of grief for developers. Bikas Saha draws on his experience debugging and analyzing Hadoop jobs to describe some methodical approaches and present new tracing and tooling ideas that can help semi-automate parts of this difficult problem.

large-scale hadoop aws complexity performance pipeline

comments powered by Disqus

Practical performance tips to make your cross-platform mobile apps faster

Practical performance tips to make your cross-platform mobile apps faster

November 6, 2019

Apache Cordova is one of the most popular frameworks for cross-platform mobile development. To build Cordova apps that perform well, its important to understand how to use the technologies in the most efficient ways. Doris Chen outlines what impacts "native performance," demonstrates how to measure mobile app performance, and shares practical tips for building faster Cordova apps.

Building DistributedLog, a high-performance replicated log service

Building DistributedLog, a high-performance replicated log service

October 27, 2019

DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter.

Designing a scalable real-time data platform using Akka, Spark Streaming, and Kafka

Designing a scalable real-time data platform using Akka, Spark Streaming, and Kafka

October 26, 2019

Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis.

Faster conclusions using in-memory columnar SQL and machine learning

Faster conclusions using in-memory columnar SQL and machine learning

October 25, 2019

Hadoops traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions.

Hadoop in the cloud: Good fit or round peg in a square hole?

Hadoop in the cloud: Good fit or round peg in a square hole?

October 25, 2019

Thomas Phelan and Joel Baxter investigate the advantages and disadvantages of running specific Hadoop workloads in different infrastructure environments. Thomas and Joel then provide a set of rules to help users evaluate big data runtime environments and deployment options to determine which is best suited for a given application.

High-performance clickstream analytics with Apache Phoenix and HBase

High-performance clickstream analytics with Apache Phoenix and HBase

October 25, 2019

Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.