PyTextRank: Graph algorithms for enhanced natural language processing
Paco Nathan demonstrates how to use PyTextRankan open source Python implementation of TextRank that builds atop spaCy, datasketch, NetworkX, and other popular libraries to prepare raw text for AI applications in media and learningto move beyond outdated techniques such as stemming, n-grams, or bag-of-words while performing advanced NLP on single-server solutions.
Talk Title | PyTextRank: Graph algorithms for enhanced natural language processing |
Speakers | Paco Nathan (derwen.ai) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 26-28, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
PyTextRank is a Python open source implementation of TextRank, a graph algorithm for NLP based on the Mihalcea 2004 paper. The package is intended to complement other machine learning approaches, specifically deep learning used in custom search and recommendations, by generating enhanced feature vectors from raw texts. PyTextRank builds on builds on spaCy, datasketch, NetworkX, and other popular Python libraries. Results include full parse from raw texts, vectors of ranked keyphrases, and adjustable autosummarization. PyTextRank is used in production at scale by O’Reilly Media and is available on PyPi and GitHub. Previous generations of NLP used shortcuts such as stemming, bag of words, and n-grams, which tend to degrade results. In contrast, PyTextRank uses lemmatization, named entity resolution, hypernyms, and graph-based semantic analysis. Advances in popular Python libraries for statistical parsing, graph analytics, probabilistic data structures, as well the availability of multicore processors with large memory spaces, make possible more effective approaches to NLP which do not require clusters. Resulting keyphrase vectors are significantly more useful than simple keyword extraction, especially for vector embedding. Moreover, this approach allows import of an ontology to help refine results. In other words, inference extends the parsing capabilities into natural language understanding. Paco Nathan illustrates PyTextRank use cases in media and learning to enable semisupervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.