Applying petabyte-scale analytics and machine learning to billions of news reading sessions

What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.


Talk Title	Applying petabyte-scale analytics and machine learning to billions of news reading sessions
Speakers	Andrew Montalenti (Parse.ly )
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Parse.ly runs a real-time web and content analytics platform that serves 350+ enterprise clients, 30,000+ site operators, and thousands of high-traffic sites. This platform is used to understand audience, content, and attention at a granular level, but the aggregate data exhaust from these integrations provides a front-row seat to what the internet is looking at today. Andrew Montalenti explains how consumer attention in the web era really works (e.g., to what degree Facebook and Google dominate consumer web attention versus more niche platforms). Andrew also showcases how Parse.ly recently applied modern natural language processing and machine learning techniques to better understand its evolving dataset of more than a million unique pieces of content per day, including how the company classified all web pages into a structured content taxonomy and automatically extracted out relevant topics and entities. Alongside some of these network data findings related to news trends, social networks, search engines, and device usage patterns, Andrew also digs into the technology running under the hood, particularly multicloud setups (in the hundreds) with Elasticsearch, Cassandra, Kafka, Storm, and Spark, and discusses open source projects the company has built and released, such as PyKafka and streamparse. Andrew even talks about Parse.ly’s recent adoption of serverless cloud tooling, which makes machine learning easier. Andrew concludes by explaining how Parse.ly’s web-wide trend data has been used so far, such as for content strategy inside major newsrooms as well as for predicting offline consumer behavior (e.g., which movies would win at the box office based on the web attention those movies received in weeks prior).

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

Pangeo: Big data climate science in the cloud

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Panel: Open Networking Driving Data Center and Cloud Innovation

Distributed TensorFlow on Hops