December 30, 2019

182 words 1 min read

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

Brooke Wenig introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.


Talk Title	Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML
Speakers	Brooke Wenig (Databricks)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 26-28, 2017
URL	Talk Page
Slides	Talk Slides
Video

Brooke Wenig introduces you to Apache Spark 2.0 core concepts with a focus on Spark’s machine learning library, using text mining on real-world data as the primary end-to-end use case. Join in to explore and wrangle data using Spark’s DataSet and DataFrame abstractions. You’ll use the Spark ML API to build an ML pipeline to transform free text into useful features via Spark ML’s Transformer abstraction (e.g., one-hot encoding and term frequency counting) and learn about model selection, training/fitting, and validation/inspection, as well as parameter tuning with grid search parameter selection. The class will consist of approximately 50% hands-on programming labs in Scala and 50% lecture and discussion.

api apache dataset spark ml analytics programming use case machine learning pipeline lecture

comments powered by Disqus

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)

October 31, 2019

Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.

Humans in the loop: Jupyter notebooks as a frontend for AI pipelines at scale

Humans in the loop: Jupyter notebooks as a frontend for AI pipelines at scale

December 22, 2019

Paco Nathan reviews use cases where Jupyter provides a frontend to AI as the means for keeping humans in the loop. This process enhances the feedback loop between people and machines, and the end result is that a smaller group of people can handle a wider range of responsibilities for building and maintaining a complex system of automation.

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more

Real-time machine learning with Redis, Apache Spark, TensorFlow, and more

November 30, 2019

Kamran Yousaf explains how to substantially accelerate and radically simplify common practices in machine learning, such as running a trained model in production, to meet real-time expectations, using Redis modules that natively store and execute common models generated by Spark ML and TensorFlow algorithms.

Unified stateful big data processing in Apache Beam (incubating)

Unified stateful big data processing in Apache Beam (incubating)

November 29, 2019

Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.