January 6, 2020

342 words 2 mins read

Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale

Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for rsum content.

Talk Title Spark NLP in action: How Indeed applies NLP to standardize rsum content at scale
Speakers Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date April 30-May 2, 2019
URL Talk Page
Slides Talk Slides

More people find jobs on Indeed than anywhere else. With two hundred million unique visitors a month, Indeed has accumulated hundreds of millions of jobs and résumés and trillions of data points of activity. Much of this data is entered by users. Because users express the same or similar facts in different ways, Indeed needs to standardize these fields. The traditional solution is to use a human-curated list of replacement rules. But with datasets as large and diverse as Indeed’s, the better solution is to use the data to normalize itself. Spark NLP—John Snow Labs’ NLP library for Apache Spark—is an open source library that natively extends Spark ML to provide natural language processing capabilities with high performance, accuracy, and scalability. Spark NLP has algorithms that consist of rule-based, machine learning, and deep learning models. It provides advanced NLP functionalities like named-entity recognition, fact extraction, spell checking, sentiment analysis, assertion status detection, and others. These algorithms are combined via NLP pipelines to automate the multiple steps necessary to normalize natural language text, from spelling correction to stemming to using corpus statistics to identify preferred forms. Alexis Yelton and Alex Thomas explain how to combine Spark NLP with Apache Spark’s built-in algorithms to create standardized semistructured text directly from résumés and job descriptions. These standardized strings can then be used to improve résumé or job search engines or to feed into machine learning models used for everything from predicting apply rates to recommending jobs to job seekers. Join in to explore the technical challenges, the algorithms, and how you can use them in your next text-processing project.

comments powered by Disqus