Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for rsum content.
Talk Title | Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale |
Speakers | Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | April 30-May 2, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
More people find jobs on Indeed than anywhere else. With two hundred million unique visitors a month, Indeed has accumulated hundreds of millions of jobs and résumés and trillions of data points of activity. Much of this data is entered by users. Because users express the same or similar facts in different ways, Indeed needs to standardize these fields. The traditional solution is to use a human-curated list of replacement rules. But with datasets as large and diverse as Indeed’s, the better solution is to use the data to normalize itself. Spark NLP—John Snow Labs’ NLP library for Apache Spark—is an open source library that natively extends Spark ML to provide natural language processing capabilities with high performance, accuracy, and scalability. Spark NLP has algorithms that consist of rule-based, machine learning, and deep learning models. It provides advanced NLP functionalities like named-entity recognition, fact extraction, spell checking, sentiment analysis, assertion status detection, and others. These algorithms are combined via NLP pipelines to automate the multiple steps necessary to normalize natural language text, from spelling correction to stemming to using corpus statistics to identify preferred forms. Alexis Yelton and Alex Thomas explain how to combine Spark NLP with Apache Spark’s built-in algorithms to create standardized semistructured text directly from résumés and job descriptions. These standardized strings can then be used to improve résumé or job search engines or to feed into machine learning models used for everything from predicting apply rates to recommending jobs to job seekers. Join in to explore the technical challenges, the algorithms, and how you can use them in your next text-processing project.