ETL 2.0: Its not just for data engineers anymore
March 3, 2020
Robin Moffatt explores the concepts of events, their relevance to software and data engineers, and their ability to unify architectures in a powerful way. Join in to learn why analytics, data integration, and ETL fit naturally into a streaming world. Along the way, Robin leads a hands-on demonstration of these concepts in practice and commentary on the design choices made.
Machine learning over real-time streaming data with TensorFlow
February 23, 2020
In many applications where data is generated continuously, combining machine learning with streaming data is imperative to discover useful information in real time. Yong Tang explores TensorFlow I/O, which can be used to easily build a data pipeline with TensorFlow and stream frameworks such as Apache Kafka, AWS Kinesis, or Google Cloud PubSub.
Deep learning with Horovod and Spark using GPUs and Docker containers
February 20, 2020
Today, organizations understand the need to keep pace with new technologies when it comes to performing data science with machine learning and deep learning, but these new technologies come with their own challenges. Thomas Phelan demonstrates the deployment of TensorFlow, Horovod, and Spark using the NVIDIA CUDA stack on Docker containers in a secure multitenant environment.
Apache Hadoop 3.x state of the union and upgrade guidance
February 16, 2020
Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x.
Deep learning on Apache Spark at CERNs Large Hadron Collider with Analytics Zoo
February 15, 2020
Sajan Govindan outlines CERNs research on deep learning in high energy physics experiments as an alternative to customized rule-based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider. CERN uses deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters.
Improving Spark by taking advantage of disaggregated architecture
February 12, 2020
Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in todays data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers.
Now you see me; now you compute: Building event-driven architectures with Apache Kafka
February 11, 2020
Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.
Parquet modular encryption: Confidentiality and integrity of sensitive column data
February 11, 2020
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.
Protecting the healthcare enterprise from PHI breaches using streaming and NLP
February 11, 2020
Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.
Scalable anomaly detection with Spark and SOS
February 10, 2020
Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.
The evolution of metadata: LinkedIns story
February 9, 2020
Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIns metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.
Using Spark for crunching astronomical data on the LSST scale
February 8, 2020
The Large Scale Survey Telescope (LSST) is one of the most important future surveys. Its unique design allows it to cover large regions of the sky and obtain images of the faintest objects. After 10 years of operation, it will produce about 80 PB of data in images and catalog data. Petar Zecevic explains AXS, a system built for fast processing and cross-matching of survey catalog data.
Introducing Kubeflow (with special guests TensorFlow and Apache Spark)
February 4, 2020
Modeling is easyproductizing models, less so. Distributed training? Forget about it. Say hello to Kubeflow with Holden Karaua system that makes it easy for data scientists to containerize their models to train and serve on Kubernetes.
Building machine learning inference pipelines at scale
January 31, 2020
Real-life ML workloads require more than training and predicting: data often needs to be preprocessed and postprocessed. Developers and data scientists have to train and deploy a sequence of algorithms that collaborate in delivering predictions from raw data. Julien Simon outlines how to build machine learning inference pipelines using open source libraries and how to scale them on AWS.
End-to-end ML streaming with Kubeflow, Kafka, and Redis at scale
January 30, 2020
With ubiquitous ML models, model serving and pipelining is more important now. Comcast runs hundreds of models at scale with Kubernetes and Kubeflow. Together with other popular open source streaming platforms such as Apache Kafka and Redis, Comcast invokes models billions of times per day while maintaining high availability guarantees and quick deployments. Join Nick Pinckernell to learn how.
How China's search company Baidu adopted InnerSource
January 29, 2020
Open source has been very popular in China in recent years, but InnerSource is still new. Baidu, the Chinese search engine company, began to adopt InnerSource two years ago. Tan Zhongyi leads this project, and he details how this happened and the challenges the company faced and overcame.
(Unifying analytics and AI on big data for faster insights at scale)
January 22, 2020
Ziya Ma walks you through Intels scalable data insights strategy and related big data analytics and AI technologies such as Analytics Zooan end-to-end analytics and AI pipeline for developing full solutions with Apache Spark on Intel Xeon and Intel Optane DC Persistent Memory at scale. She highlights customers use cases and collaboration with industry leaders throughout.
Monitor disk space and other ways to keep Apache Kafka happy
January 19, 2020
After five years of helping hundreds of customers use Apache Kafka, you've seen it all. Gwen Shapira provides an overview of the most common ways Apache Kafka users manage to cause downtime and lose data. And how to avoid them.
Deep learning with TensorFlow and Spark using GPUs and Docker containers
January 12, 2020
Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment.
Migrating Apache Oozie workflows to Apache Airflow
January 8, 2020
Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.