January 26, 2020

364 words 2 mins read

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.

Talk Title Applying petabyte-scale analytics and machine learning to billions of news reading sessions
Speakers Andrew Montalenti (Parse.ly )
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 11-13, 2018
URL Talk Page
Slides Talk Slides
Video

Parse.ly runs a real-time web and content analytics platform that serves 350+ enterprise clients, 30,000+ site operators, and thousands of high-traffic sites. This platform is used to understand audience, content, and attention at a granular level, but the aggregate data exhaust from these integrations provides a front-row seat to what the internet is looking at today. Andrew Montalenti explains how consumer attention in the web era really works (e.g., to what degree Facebook and Google dominate consumer web attention versus more niche platforms). Andrew also showcases how Parse.ly recently applied modern natural language processing and machine learning techniques to better understand its evolving dataset of more than a million unique pieces of content per day, including how the company classified all web pages into a structured content taxonomy and automatically extracted out relevant topics and entities. Alongside some of these network data findings related to news trends, social networks, search engines, and device usage patterns, Andrew also digs into the technology running under the hood, particularly multicloud setups (in the hundreds) with Elasticsearch, Cassandra, Kafka, Storm, and Spark, and discusses open source projects the company has built and released, such as PyKafka and streamparse. Andrew even talks about Parse.ly’s recent adoption of serverless cloud tooling, which makes machine learning easier. Andrew concludes by explaining how Parse.ly’s web-wide trend data has been used so far, such as for content strategy inside major newsrooms as well as for predicting offline consumer behavior (e.g., which movies would win at the box office based on the web attention those movies received in weeks prior).

comments powered by Disqus