Document vectors in the wild: Building a content recommendation system for Reuters.com
James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.
Talk Title | Document vectors in the wild: Building a content recommendation system for Reuters.com |
Speakers | James Dreiss (Reuters) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 11-13, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
In the summer of 2017, Reuters.com embarked on an ambitious redesign of its article pages, specifically a scroll design in which articles that users request to read are immediately followed by related (or possibly unrelated) articles. The initial launch of the scroll model made recommendations based on content alone, independent of user behavior. Given the advantages of word and document embedding models and the particularities of Reuters.com content, the system was designed to use document vectors to to determine article similarity. Being unsupervised, document vectors need some supervised learning assistance if being used in a production system. James Dreiss discusses the development of the supervised topic filtering model that sits on top of the document vector model, as well as additional filtering strategies. Measuring performance of word and document vectors is notoriously difficult, but some heuristics have been developed. James offers a brief overview of measuring word and document vector performance and explains how he ultimately tackled the problem. James also details how he tested a pet theory that users would want diversity in content, especially given the wall-to-wall coverage of certain subjects, such as Donald Trump, and shares the results of serving both similarly and dissimilarly related content to users. James concludes by covering the cookie-based personalization system that was later implemented for content recommendation on article scrolls, including test results comparing the two systems.