December 30, 2019

328 words 2 mins read

PyTextRank: Graph algorithms for enhanced natural language processing

PyTextRank: Graph algorithms for enhanced natural language processing

Paco Nathan demonstrates how to use PyTextRankan open source Python implementation of TextRank that builds atop spaCy, datasketch, NetworkX, and other popular libraries to prepare raw text for AI applications in media and learningto move beyond outdated techniques such as stemming, n-grams, or bag-of-words while performing advanced NLP on single-server solutions.

Talk Title PyTextRank: Graph algorithms for enhanced natural language processing
Speakers Paco Nathan (derwen.ai)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 26-28, 2017
URL Talk Page
Slides Talk Slides
Video

PyTextRank is a Python open source implementation of TextRank, a graph algorithm for NLP based on the Mihalcea 2004 paper. The package is intended to complement other machine learning approaches, specifically deep learning used in custom search and recommendations, by generating enhanced feature vectors from raw texts. PyTextRank builds on builds on spaCy, datasketch, NetworkX, and other popular Python libraries. Results include full parse from raw texts, vectors of ranked keyphrases, and adjustable autosummarization. PyTextRank is used in production at scale by O’Reilly Media and is available on PyPi and GitHub. Previous generations of NLP used shortcuts such as stemming, bag of words, and n-grams, which tend to degrade results. In contrast, PyTextRank uses lemmatization, named entity resolution, hypernyms, and graph-based semantic analysis. Advances in popular Python libraries for statistical parsing, graph analytics, probabilistic data structures, as well the availability of multicore processors with large memory spaces, make possible more effective approaches to NLP which do not require clusters. Resulting keyphrase vectors are significantly more useful than simple keyword extraction, especially for vector embedding. Moreover, this approach allows import of an ontology to help refine results. In other words, inference extends the parsing capabilities into natural language understanding. Paco Nathan illustrates PyTextRank use cases in media and learning to enable semisupervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.

comments powered by Disqus