The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit
Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both.
Talk Title | The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit |
Speakers | Marcel Ruiz Forns (Wikimedia Foundation) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | April 30-May 2, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Privacy is one of the lesser-known charms of Wikipedia. Wikipedia’s stance on privacy allows users to access and modify the wiki in anonymity, without fear of giving away personal information or editorship or browsing history. The Wikimedia Foundation (WMF), the nonprofit behind Wikipedia’s software and infrastructure, has strict privacy and data retention policies that were developed with the Wikimedia community at large. Practically all data containing personal identifiers or user activity must be deleted or anonymized 90 days, at most, after its collection. However, the organization and the community are eager to use big data to better understand the ecosystem and improve it. As of this writing, developer teams are sending more than 2,000 custom events per second to the analytics pipeline and constantly feeding 200+ datasets. That is in addition to the 10 billion (US) web request logs that are ingested daily into the Hadoop cluster and are used to populate several important tools, like WMF’s analytics API. The long-term existence of this data is key to the foundation’s analysts and researchers. Is it possible to retain value from these datasets when they are controlled by such strict privacy policies? How can you ensure that new datasets follow the policies without bureaucratic bottlenecks? What advantages does sanitizing data have beyond compliance? Marcel Ruiz Forns covers the challenge of maintaining the value of data while significantly reducing the risk of user identification and privacy loss, and how WMF’s analytics team approaches it. Discover whether your company would benefit from the vegan data diet.