January 4, 2020

352 words 2 mins read

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit

Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both.

Talk Title The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit
Speakers Marcel Ruiz Forns (Wikimedia Foundation)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date April 30-May 2, 2019
URL Talk Page
Slides Talk Slides
Video

Privacy is one of the lesser-known charms of Wikipedia. Wikipedia’s stance on privacy allows users to access and modify the wiki in anonymity, without fear of giving away personal information or editorship or browsing history. The Wikimedia Foundation (WMF), the nonprofit behind Wikipedia’s software and infrastructure, has strict privacy and data retention policies that were developed with the Wikimedia community at large. Practically all data containing personal identifiers or user activity must be deleted or anonymized 90 days, at most, after its collection. However, the organization and the community are eager to use big data to better understand the ecosystem and improve it. As of this writing, developer teams are sending more than 2,000 custom events per second to the analytics pipeline and constantly feeding 200+ datasets. That is in addition to the 10 billion (US) web request logs that are ingested daily into the Hadoop cluster and are used to populate several important tools, like WMF’s analytics API. The long-term existence of this data is key to the foundation’s analysts and researchers. Is it possible to retain value from these datasets when they are controlled by such strict privacy policies? How can you ensure that new datasets follow the policies without bureaucratic bottlenecks? What advantages does sanitizing data have beyond compliance? Marcel Ruiz Forns covers the challenge of maintaining the value of data while significantly reducing the risk of user identification and privacy loss, and how WMF’s analytics team approaches it. Discover whether your company would benefit from the vegan data diet.

comments powered by Disqus