January 14, 2020

227 words 2 mins read

Apache Spark and machine learning on microservices

Apache Spark and machine learning on microservices

Hadoop-based data platforms that power ETL jobs and machine learning pipelines are great examples of monolithic architectures that could be redesigned with microservices. Stepan Pushkarev walks you through building and deploying data processing, reporting services, training, and prediction pipelines as decoupled microservices connected with the rest of the enterprise architecture.


Talk Title	Apache Spark and machine learning on microservices
Speakers	Stepan Pushkarev (hydrosphere.io)
Conference	O’Reilly Software Architecture Conference
Conf Tag	Engineering the Future of Software
Location	London, United Kingdom
Date	October 16-18, 2017
URL	Talk Page
Slides	Talk Slides
Video

Usually data scientists find it challenging to create a clean REST API; likewise, web developers find it almost impossible to understand machine learning internals. And big data engineers tend to use clunky Hadoop distributions with dozens of tightly coupled tools and then continue to follow this design, developing data processing scripts that communicate through unmanageable state and shared flags. Hydrosphere.io helps data scientists and big data engineers plug into modern reactive and microservices architectures that have already been adopted by traditional web and enterprise teams. Hadoop-based data platforms that power ETL jobs and machine learning pipelines are great examples of monolithic architectures that could be redesigned with microservices. Stepan Pushkarev walks you through building and deploying data processing, reporting services, training, and prediction pipelines as decoupled microservices connected with the rest of the enterprise architecture. Topics include:

prediction api apache spark microservice etl hadoop big data machine learning pipeline react

comments powered by Disqus

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

December 30, 2019

Brooke Wenig introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.

The columnar roadmap: Apache Parquet and Apache Arrow

The columnar roadmap: Apache Parquet and Apache Arrow

December 29, 2019

Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.

Accelerate analytics and AI innovations with Intel (sponsored by Intel)

Accelerate analytics and AI innovations with Intel (sponsored by Intel)

December 5, 2019

Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …

The state of Spark in the cloud

The state of Spark in the cloud

November 29, 2019

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.