Evolution of a modern cloud-based data lake
March 3, 2020
Building a data lake is a hard task. You have to centralize all the data of the company in one place, it must be easily accessible, and governance has to be done right. And, last but not least, the price has to stay reasonable. All those aspects come up as quite a challenge. But never fear. Viacheslav Inozemtsev outlines the experience of building Zalando's data lake.
A novel solution for a data augmentation and bias problem in NLP using TensorFlow
February 24, 2020
Join KC Tung to discover a way to use TensorFlow to solve a natural language processing (NLP) model bias problem with data augmentation for an enterprise customer (one of the largest airlines in the world). KC leveraged hidden gems in tf.data and the new API to easily find a novel use for text generation and found it surprisingly improved his NLP model.
Effective sampling methods within TensorFlow input functions
February 24, 2020
Many real-world machine learning applications require generative or reductive sampling of data. Laxmi Prajapat and William Fletcher demonstrate sampling techniques applied to training and testing data directly inside the input function using the tf.data API.
Generative malware outbreak detection
February 23, 2020
Practical defense systems require precise detection during malware outbreaks with only a handful of available samples. Sean Park demonstrates how to detect in-the-wild malware samples with a single training sample of a kind, with the help of TensorFlow's flexible architecture in implementing a novel variable-length generative adversarial autoencoder.
Anomaly detection using deep learning to measure the quality of large datasets
February 22, 2020
Any business, big or small, depends on analytics, whether the goal is revenue generation, churn reduction, or sales or marketing purposes. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. Sridhar Alla examines some techniques used to evaluate the quality of data and the means to detect the anomalies in the data.
Audience projection of target consumers over multiple domains: A NER and Bayesian approach
February 21, 2020
AI-powered market research is performed by indirect approaches based on sparse and implicit consumer feedback (e.g., social network interactions, web browsing, or online purchases). These approaches are more scalable, authentic, and suitable for real-time consumer insights. Gianmario Spacagna proposes a novel algorithm of audience projection able to provide consumer insights over multiple domains.
Architecting a data analytics service both in the public cloud and in the on-premise private cloud: ETL, BI, and machine learning (sponsored by SK Holdings)
February 16, 2020
Jungwook Seo walks you through a data analytics platform in the cloud by the name of AccuInsight+ with eight data analytic services in the CloudZ (one of the biggest cloud service providers in Korea), which SK Holdings announced in January 2019.
Learning with limited labeled data
February 12, 2020
Supervised machine learning requires large labeled datasetsa prohibitive limitation in many real world applications. But this could be avoided if machines could earn with a few labeled examples. Shioulin Sam explores and demonstrates an algorithmic solution that relies on collaboration between human and machine to label smartly, and she outlines product possibilities.
Sketching data and other magic tricks
February 10, 2020
Go hands-on with Sophie Watson and William Benton to examine data structures that let you answer interesting queries about massive datasets in fixed amounts of space and constant time. This seems like magic, but they'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.
Working with time series: Denoising and imputation frameworks to improve data density
February 8, 2020
The application of smoothing and imputation strategies is common practice in predictive modeling and time series analysis. With a technique-agnostic approach, Anjali Samani provides qualitative and quantitative frameworks that address questions related to smoothing and imputation of missing values to improve data density.
Your easy move to serverless computing and radically simplified data processing
February 7, 2020
Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the push to the cloud experience, which dramatically simplifies serverless for big data processing frameworks.
Long-term real-time network traffic flow prediction using LSTM recurrent neural network
February 4, 2020
Real-time traffic volume prediction is vital in proactive network management, and many forecasting models have been proposed to address this. However, most are unable to fully use the information in traffic data to generate efficient and accurate traffic predictions for a longer term. Wei Cai explores predicting multistep, real-time traffic volume using many-to-one LSTM and many-to-many LSTM.
Putting cutting-edge modern NLP into practice
February 3, 2020
AllenNLP is a PyTorch-based library designed to make it easy to do high-quality research in natural language processing (NLP). Joel Grus explains what modern neural NLP looks like; you'll get your hands dirty training some models, writing some code, and learning how you can apply these techniques to your own datasets and problems.
Scaling AI at Cerebras
February 3, 2020
Long training times are the single biggest factor slowing down innovation in deep learning. Today's common approach of scaling large workloads out over many small processors is inefficient and requires extensive model tuning. Urs Kster explains why with increasing model and dataset sizes, new ideas are needed to reduce training times.
Building machine learning inference pipelines at scale
January 31, 2020
Real-life ML workloads require more than training and predicting: data often needs to be preprocessed and postprocessed. Developers and data scientists have to train and deploy a sequence of algorithms that collaborate in delivering predictions from raw data. Julien Simon outlines how to build machine learning inference pipelines using open source libraries and how to scale them on AWS.
January 27, 2020
Machine learning (ML) drove massive growth at consumer internet companies over the last decade, enabled by open software, datasets, and AI research. For many problems, ML will produce better, faster, and more repeatable decisions at scale. Unfortunately, building and maintaining these systems is difficult and expensive. Pete Skomoroch explores what you need to produce better ML results.
Removing unfair bias in machine learning using open source (sponsored by IBM)
January 25, 2020
ML models are increasingly used to make decisions that impact lives. Ana Echeverri and Trisha Mahoney walk you through how to use the open source Python package AI Fairness 360, developed by IBM researchers, a comprehensive open source toolkit empowering users with metrics to check for unwanted bias in datasets and machine learning models and state-of-the-art algorithms to mitigate such bias.
Herding elephants: Seamless data access in a multicluster clouds
January 10, 2020
Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms.
Learning "learning to rank"
January 9, 2020
Identifying relevant documents quickly and efficiently enhances both user experience and business revenue every day. Sophie Watson demonstrates how to implement learning-to-rank algorithms and provides you with the information you need to implement your own successful ranking system.
Mastering data with Spark and machine learning
January 8, 2020
Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Sonal Goyal offers an overview of a modern master data application using Spark, Cassandra, ML, and Elastic.
Reinforcement learning: A gentle introduction and an industrial application
January 6, 2020
Reinforcement learning (RL) learns complex processes autonomously like walking, beating the world champion in Go, or flying a helicopter. No big datasets with the right answers are needed: the algorithms learn by experimenting. Christian Hidber shows how and why RL works and demonstrates how to apply it to an industrial hydraulics application with 7,000 clients in 42 countries.