November 9, 2019

304 words 2 mins read

Large-scale GitHub insights

Large-scale GitHub insights

GitHub hosts tens of millions of people collaborating on more than 20 million repositoriesan unprecedented treasure trove of data for software engineering researchers, companies, and project teams alike. Jeff McAffer, Georgios Gousios, and Kevin Lewis explore tools and techniques for sifting through terabytes of content, present key insights they discovered, and explain how you can follow suit.

Talk Title Large-scale GitHub insights
Speakers Jeff McAffer (Microsoft), A Gousiosg (TU Delft), Kevin Lewis (Microsoft)
Conference O’Reilly Open Source Convention
Conf Tag
Location Austin, Texas
Date May 16-19, 2016
URL Talk Page
Slides Talk Slides
Video

GitHub hosts tens of millions of people collaborating on more than 20 million repositories—an unprecedented treasure trove of data for software engineering researchers, companies, and project teams alike. Researchers take interest in developer behavior and code evolution—branching, collaboration, bug/fix rates, software quality, and distributed software development. Companies look for how projects (theirs and others) are doing and discover trends in the industry. Project teams want to understand their health, uptake of their offerings, API usage, and more. Jeff McAffer, Georgios Gousios, and Kevin Lewis explore tools and techniques for sifting through terabytes of content, present key insights they discovered, and explain how you can follow suit. Jeff, Georgios, and Kevin offer an overview of the architecture of GHTorrent/DataLake, an infrastructure for tracking the activity of all (20 million) public GitHub repos and their thousands (and thousands) of events per hour. (Using this infrastructure, GitHub has analyzed the behavior of Microsoft and other repositories.) Jeff, Georgios, and Kevin present real insights in areas from contribution handling with pull requests and issues to API usage, tool adoption, and notions of project health that are applicable to researchers, developers, community members/managers, product teams, and executive sponsors. They conclude by outlining the open source stack you can use to get insights of your own.

comments powered by Disqus