PyTextRank: Graph algorithms for enhanced natural language processing

Paco Nathan demonstrates how to use PyTextRankan open source Python implementation of TextRank that builds atop spaCy, datasketch, NetworkX, and other popular libraries to prepare raw text for AI applications in media and learningto move beyond outdated techniques such as stemming, n-grams, or bag-of-words while performing advanced NLP on single-server solutions.


Talk Title	PyTextRank: Graph algorithms for enhanced natural language processing
Speakers	Paco Nathan (derwen.ai)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 26-28, 2017
URL	Talk Page
Slides	Talk Slides
Video

PyTextRank is a Python open source implementation of TextRank, a graph algorithm for NLP based on the Mihalcea 2004 paper. The package is intended to complement other machine learning approaches, specifically deep learning used in custom search and recommendations, by generating enhanced feature vectors from raw texts. PyTextRank builds on builds on spaCy, datasketch, NetworkX, and other popular Python libraries. Results include full parse from raw texts, vectors of ranked keyphrases, and adjustable autosummarization. PyTextRank is used in production at scale by O’Reilly Media and is available on PyPi and GitHub. Previous generations of NLP used shortcuts such as stemming, bag of words, and n-grams, which tend to degrade results. In contrast, PyTextRank uses lemmatization, named entity resolution, hypernyms, and graph-based semantic analysis. Advances in popular Python libraries for statistical parsing, graph analytics, probabilistic data structures, as well the availability of multicore processors with large memory spaces, make possible more effective approaches to NLP which do not require clusters. Resulting keyphrase vectors are significantly more useful than simple keyword extraction, especially for vector embedding. Moreover, this approach allows import of an ontology to help refine results. In other words, inference extends the parsing capabilities into natural language understanding. Paco Nathan illustrates PyTextRank use cases in media and learning to enable semisupervised word sense disambiguation, move from natural language parsing to natural language understanding, and implement AI-based video search and approximation algorithms for content recommendation based on semantic similarity.

PyTextRank: Graph algorithms for enhanced natural language processing

Paint the landscape and secure your data center with Apache Spot

Building deep learning-powered big data

Distinguish pop music from heavy metal using Apache Spark MLlib

Unraveling data with Spark using deep learning and other algorithms from machine learning

AI within O'Reilly Media

Using Docker Containers to Serve Deep Learing Predictions at Booking.com