December 11, 2019

285 words 2 mins read

Lessons learned running Hadoop and Spark in Docker

Lessons learned running Hadoop and Spark in Docker

Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.


Talk Title	Lessons learned running Hadoop and Spark in Docker
Speakers
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	New York, New York
Date	September 27-29, 2016
URL	Talk Page
Slides	Talk Slides
Video

Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is “all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker. BlueData’s Thomas Phelan demonstrates how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.

container management apache security spark government network configuration large-scale hadoop network big data data center docker performance scalable

comments powered by Disqus

Rethinking security from the ground up with a microservices mindset

Rethinking security from the ground up with a microservices mindset

November 6, 2019

Recent high-profile data breaches have made it clear that traditional security based on n-tier application partitioning is broken. As we move into the container era, there is a huge opportunity to revolutionize security by rendering developer intent directly into the network fabric. Andrew Randall presents an open source approach to this problem, leveraging proven IP networking and Linux concepts.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Apache Eagle: Secure Hadoop in real time

Apache Eagle: Secure Hadoop in real time

November 21, 2019

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.

Petascale genomics

Petascale genomics

November 17, 2019

The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

November 17, 2019

Moty Fania shares Intels IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platforms architecture.

Deployment and orchestration at scale with Docker Swarm

Deployment and orchestration at scale with Docker Swarm

November 13, 2019

Jrme Petazzoni and AJ Bowen demonstrate building an app from development to production with Docker. Jrme and AJ run a sample app on a single node with Compose and add scaling and load balancing. They then provision a Swarm cluster with Docker Machine and implement multihost communication with overlay networking. The result will be a highly available, scalable deployment for the application.