February 24, 2020

563 words 3 mins read

A novel solution for a data augmentation and bias problem in NLP using TensorFlow

A novel solution for a data augmentation and bias problem in NLP using TensorFlow

Join KC Tung to discover a way to use TensorFlow to solve a natural language processing (NLP) model bias problem with data augmentation for an enterprise customer (one of the largest airlines in the world). KC leveraged hidden gems in tf.data and the new API to easily find a novel use for text generation and found it surprisingly improved his NLP model.

Talk Title A novel solution for a data augmentation and bias problem in NLP using TensorFlow
Speakers KC Tung (Microsoft)
Conference O’Reilly TensorFlow World
Conf Tag
Location Santa Clara, California
Date October 28-31, 2019
URL Talk Page
Slides Talk Slides
Video

The TensorFlow ecosystem contains many valuable assets. One of which is the highly acclaimed TensorFlow high-level API. It’s critical for a fast and lightweight approach to reducing lead time in deep learning model development and hypothesis testing. It’s now possible to quickly and easily develop a novel deep learning solution to meet an important need in practice: data bias and augmentation in NLP. Solving this problem would have a far-reaching impact in model bias, offensive-language detection, language personalization, and classification. KC Tung details his work to satisfy a need of an enterprise customer (one of the largest airlines in the world) for a model that can accurately review, classify, and store texts from aircraft maintenance logs to comply with FAA regulations on aviation safety. The customer’s data is imbalanced and biased toward certain categories. Training machine learning models with imbalanced data inevitably leads to model bias, and text generation is a novel and important approach for data augmentation. In NLP, many current approaches to augmenting minority data are unsupervised and are limited to synonym swap, insertion, deletion, or oversampling. These generalized approaches often lead to a trade-off between precision and recall. They also don’t work well in practice, as enterprise data is almost always domain specific. There needs to be a better framework to generate new corpus by learning from any domain-specific underrepresented text. KC presents a novel deep learning framework built with TensorFlow to quickly achieve this goal. A benchmark model is trained on the balanced dataset. From this dataset a class is undersampled as the underrepresented, minority class text. Then a gated recurrent unit (GRU) model learns to generate more underrepresented text, which helps training a long short-term memory (LSTM) model that classifies text. The result on holdout data shows that the model trained with generated text is surprisingly effective. Classification accuracy, precision, and recall at each class are all on par with the benchmark model without compromising precision or recall. In short, this demonstrates the success of TensorFlow adoption for the enterprise customer in quickly leveraging and applying the TensorFlow high-level API in building a novel production-grade solution for deployment, demonstrating the effectiveness of a novel data-augmentation framework, identifying a “killer app” or a new core value for text generation, and best practices and guidance in navigating machine learning model bias and business impact. KC also details how to containerize the TensorFlow application and serve it in a Kubernetes cluster in the cloud, all with open source Python libraries. The TensorFlow high-level API proves to be indispensable for a fast and high-quality deep learning model development experience. Most importantly, this TensorFlow model may be deployed as a container in the cloud, on-premises, or at the edge, providing great flexibility to meet various solution architecture or business needs.

comments powered by Disqus