Applying petabyte-scale analytics and machine learning to billions of news reading sessions
What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.
|Applying petabyte-scale analytics and machine learning to billions of news reading sessions
|Andrew Montalenti (Parse.ly )
|Strata Data Conference
|Make Data Work
|New York, New York
|September 11-13, 2018
Parse.ly runs a real-time web and content analytics platform that serves 350+ enterprise clients, 30,000+ site operators, and thousands of high-traffic sites. This platform is used to understand audience, content, and attention at a granular level, but the aggregate data exhaust from these integrations provides a front-row seat to what the internet is looking at today. Andrew Montalenti explains how consumer attention in the web era really works (e.g., to what degree Facebook and Google dominate consumer web attention versus more niche platforms). Andrew also showcases how Parse.ly recently applied modern natural language processing and machine learning techniques to better understand its evolving dataset of more than a million unique pieces of content per day, including how the company classified all web pages into a structured content taxonomy and automatically extracted out relevant topics and entities. Alongside some of these network data findings related to news trends, social networks, search engines, and device usage patterns, Andrew also digs into the technology running under the hood, particularly multicloud setups (in the hundreds) with Elasticsearch, Cassandra, Kafka, Storm, and Spark, and discusses open source projects the company has built and released, such as PyKafka and streamparse. Andrew even talks about Parse.ly’s recent adoption of serverless cloud tooling, which makes machine learning easier. Andrew concludes by explaining how Parse.ly’s web-wide trend data has been used so far, such as for content strategy inside major newsrooms as well as for predicting offline consumer behavior (e.g., which movies would win at the box office based on the web attention those movies received in weeks prior).