December 27, 2019

283 words 2 mins read

Cloud architectures for data science

Cloud architectures for data science

Data is available from an incredible number of sources in an endless number of formats. Data science deals with the extraction of valuable insights from this jumble in the form of attractive visualizations. Walking you through several examples using practical tools and tricks, Margriet Groenendijk presents a typical workflow that offers a basic introduction to data science.


Talk Title	Cloud architectures for data science
Speakers
Conference	O’Reilly Software Architecture Conference
Conf Tag	Engineering the Future of Software
Location	San Francisco, California
Date	November 14-16, 2016
URL	Talk Page
Slides	Talk Slides
Video

Data science is currently a hot topic, but what is it? There are several definitions and opinions. Data science covers the complete workflow from defining a question, finding the most suitable data source, identifying the right tools, and presenting the best possible answer in a clear, engaging manner. Using weather data, geographical data, and UN country statistical data—all open datasets that are publicly available for download—Margriet Groenendijk walks you through an example of a typical workflow: defining the question, finding the data, exploring the data and finding the best tools for the analysis, cleaning and storing the data, and visualizing and summarizing the cleaned data. This work is quite often done iteratively, with each iteration informed by a growing understanding of the data through munging and crunching. Margriet concludes by highlighting some of the latest tools and tricks available to data scientists. More data is now easily accessible through REST APIs, making it even simpler to store and analyze (big) data in the cloud using tools such as Spark, Python notebooks, or Scala notebooks. These new developments make collaborating easy by allowing data scientists to easily share their data and analyses.

api open data dataset spark python data science cloud book oadm

comments powered by Disqus

Petascale genomics

Petascale genomics

November 17, 2019

The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

November 17, 2019

Which venues have similar visiting patterns? How can we detect when a user is on vacation? Can we predict which venues will be favorited by users by examining their friends' preferences? Natalino Busa explains how these predictive analytics tasks can be accomplished by using Spark SQL, Spark ML, and just a few lines of Scala code.

Building a scalable data science platform with R

Building a scalable data science platform with R

October 27, 2019

Hadoop is famously scalable, as is cloud computing. R, the thriving and extensible open source data science software. . .not so much. Mario Inchiosa and Roni Burd outline how to seamlessly combine Hadoop, cloud computing, and R to create a scalable data science platform that lets you explore, transform, model, and score data at any scale from the comfort of your favorite R environment.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

October 20, 2019

Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning.