January 18, 2020

260 words 2 mins read

Running multidisciplinary big data workloads in the cloud

Running multidisciplinary big data workloads in the cloud

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.


Talk Title	Running multidisciplinary big data workloads in the cloud
Speakers	Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. One of the challenges is keeping the data context consistent across these various workloads. Sudhanshu Arora, Stefan Salandy, Suraj Acharya, Brandon Freeman, Jason Wang, and Shravan Pabba demonstrate how to successfully manage the shared data experience to ensure a consistent experience across all various workloads. You’ll learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you’ll see how to share metadata across workloads in a big data PaaS. You’ll use the Cloudera Altus PaaS offering, powered by Cloudera Altus SDX, to run various big data workloads.

data engineering data analytics data science analytics database big data cloud paas pipeline

comments powered by Disqus

Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp)

Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp)

November 29, 2019

Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways.

Panel: Open Networking Driving Data Center and Cloud Innovation

Panel: Open Networking Driving Data Center and Cloud Innovation

January 14, 2020

Open networking encourages innovation through collaboration. Open specifications and the reference code also allow vendors to leverage external networking competencies to formulate and commercialize n …

Building an Enterprise/Cloud Analytics Platform with Jupyter Enterprise Gateway

Building an Enterprise/Cloud Analytics Platform with Jupyter Enterprise Gateway

January 9, 2020

Data science and analytics departments are now common place for enterprises determined to maximize their operations. While Jupyter Notebooks have significantly decreased the cost of admission into this space, enterprises are finding that data science at scale is difficult within the current framework. Jupyter Enterprise Gateway is designed to address these scalability issues for the enterprise.

Pangeo: Big data climate science in the cloud

Pangeo: Big data climate science in the cloud

January 6, 2020

Climate science is being flooded with petabytes of data, overwhelming traditional modes of data analysis. The Pangeo project is building a platform to take big data climate science into the cloud using SciPy and large-scale interactive computing tools. Join Ryan Abernathey and Yuvi Panda to find out what the Pangeo team is building and why and learn how to use it.

StreamDM: Advanced data science with Spark Streaming

StreamDM: Advanced data science with Spark Streaming

December 5, 2019

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noahs Ark Lab and Tlcom ParisTech.

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

December 5, 2019

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the systems overall results.