Deep learning for domain-specific entity extraction from unstructured text
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.
Talk Title | Deep learning for domain-specific entity extraction from unstructured text |
Speakers | Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 6-8, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Biomedical named entity recognition is a critical step for complex biomedical NLP tasks such as understanding the interactions between different entity types, such as the drug-disease relationship or the gene-protein relationship. Feature generation for such tasks is often complex and time consuming. However, neural networks can obviate the need for feature engineering and use original data as input. Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained with word2vec learning algorithm on a Spark cluster using millions of Medline PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction, using Keras with TensorFlow or CNTK on a GPU-enabled Azure Data Science Virtual Machine (DSVM). Results show that training a domain-specific word embedding model boosts performance when compared to embeddings trained on generic data such as Google News.