Playing with GitHub data and researching OSS – current state of art

Report on current state of art in researching open source software and teams on GitHub

The main idea of this article is to present people who research teams and oss created on GitHub and software which using github-driven data to present more hidden characteristic of code repositories placed on GitHub.

– 4th March 2014

  1. Software implementations, tools and practical establishments

“The Open Source Report Card” is a portal available at the address It is also an open source project developed on GitHub and licensed under the MIT License. At the top of the page they state a warning: “Dear recruiters, GitHub is not your C.V. and that these stats only provide a biased and one-sided view.”. Their website is also powered by FusionAds template ( In the centre it have a textbox in which it’s possible to enter a valid GitHub user login, which will lead to an individual report card. Users is analysed on following categories: languages, schedule, organization membership, recent activity. A nick of most similar user is shown, and a list of 5 most similar in activity users is given as well.

GitHub visualizer is a website available at the address It allows to create a visualization of work done in a chosen GitHub repository. is another example of a report card, it is a portal available at the address It shows short summarization of repositories and languages used. There is no information about hours of activity and a similarity to other GitHub users. is a huge project considering open source. The portal is available at address They say they do indexing of 663,168 open source projects. Website gives options of viewing: people, projects, organizations through rankings or a search engine. They collect not only GitHub data, but also other repository providers. Portal encourages to create “FLOSS resumes”. Portals basically works by “claiming” a contribution which finally identifies a proper person to his/her project.

Octoboard ( ) is a GitHub activity dashboard. Octoboard is based on GitHub Archive : each day, it scans new GitHub events archives and computes a few stats, with a 15 days history. You can see some general data on this page, or use menu for more information about language and history. Octoboard is an open source project built for the GitHub Data Challenge by Denis Roussel.

  1. Persons, companies, and research teams

In the beginning, there was a list of “Most active GitHub users” created by Paul Miller and available at page For some time it was updated regularly, but the list recently received negative comments, and there is a better user list which got positive recognition, and available at the address Creator of this list is Mikołaj Pawlikowski (

Donnie Berkholz is an analyst at RedMonk and their resident postdoc. Recently he presented an article on his company blog: GitHub language trends and the fragmenting landscape. He shows trends on programming languages in GitHub regarding activity, and separately issues. He also analyses new users and new repos. He also analyzed topic of GitHub popularity: GitHub will hit 5 million users within a yearBAM! GitHub prediction nailed: 4M users in August, 5M in December.

  1. GitHub employees and their work

GitHub developer program was announced at 6th of March 2014. They state: “By joining the Developer Program, you’ll receive ongoing notifications about changes to our API. You’ll be eligible to receive early access on select feature releases, and can request a development license for GitHub Enterprise. You can also submit your work for consideration on the integrations page.” What I believe these are the main advantages of joining the program: official recognition of our work, getting access to the newest API before public release, and a possibility of having plan for private repositories for free. Team members get the “developer” badge at their profile page. Moreover, it gives a licence for using the GitHub and Octocat logo legally. An example of a developer badge owner is Rafał Chmiel ( (update 02.05.14 – seems like he resigned from this program? Badge no longer visible) – he created a github-cheat-sheet (a witty list of less known and useful tricks for GitHub).

At this blog: a team of programmers at GitHub explain how they created a statistic engine (for the web traffic).

A duet of Brian Doll (GitHub) and Ilya Gregorik (Google) presented a topic on “Analyzing Millions of GitHub Commits – what makes developers happy, angry, and everything in between?”

  1. Data mining, querying, and further more

GitHub html resources (e.g. use Python’s package – “Beautiful Soup” or a Selenium). All data is visible in a browser, but some data is drawn on demand. Many elements are shown through AJAX elements, thus it’s not always easy to get rich HTML data, but ones need to simulate a browser behaviour. Yet, we use this mechanism in example to create a dataset of dialogs (software available at

GitHub API

Self-explanatory. A programming interface to ask GitHub for details. Limited by quotas ( ). Used well in out projects and reliable. Quota is no longer a problem – we use switching between accounts during scripts execution.

GitHub Archive

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. We downloaded whole GitHub archive and transformed the JSON documents into a mongo database on our servers.

GHTorrent project

GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database. They offer downloadable archives with database dumps, and online query tools to both mysql and mongodb aswell. The project is documented with a database relationship schema. Still, all the data is based on events, so don’t expect you will find there everything available physically from GitHub.

GitHub Big Query

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce. Google BigQuery added the GitHub timeline (data of all events, be sure to check explanation of a GitHub event here: ). Mining lot of data, especially when it comes to sophisticated queries, is expensive and requires a setup of paid Google account. BigQuery (BQ) is reportedly based on Dremel, a scalable, interactive ad hoc query system for analysis of read-only nested data. To use the data in BigQuery, it first must be uploaded to Google Storage and in a second step imported using the BigQuery HTTP API. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

GitHub explore page – available at the address is a list of trending repositories.

  1. 3rd party tools

Object-oriented GitHub API for Java – Version 0.7.5 as on 1th May 2014. They say that despite the fact “there are a few other Java adapters of Github API, our implementation has its advantages, including: all classes are private and implement public interfaces, out-of-the-box in-memory mock of Github server, all classes are truly immutable and thread-safe, every Github object gives GET/PATCH access to its raw JSON, HTTP request is accessible for modifications, and finally: entire Github API is available, at least through a configurable HTTP request”.

Programming languages statistics – – is a portal where data of language popularity is aggregated from many data sources, including GitHub, Google search engine, Google code, etc. Very informative and interesting portal.

Coderwall ( ) allows to build a page for a developer yet it does not aggregate data to create a profile, so it is a limited but interesting data source.

  1. Past and oncoming conferences

36th International Conference on Software Engineering, Hyderabad
OpenSym 2014 in Berlin, FLOSS track

  1. Publications

Here I list papers I found regarding research on open-source software. The only requirement I had is a fact the the article must mention GitHub in it’s text and key-words. I will make a more precise list later which will filter out research which is not using GitHub – driven data (i.e. SourceForge is still a base for a research, but this is changing).


Investigating the geography of open source software through GitHub

Visualizing collaboration and influence in the open-source software community


Social coding in GitHub: transparency and collaboration in an open software repository
Capacitated team formation problem on social networks
Biological Mutualistic Models Applied to Study Open Source Software Development
Social media and success in open source projects
Towards content-driven reputation for collaborative code repositories
GHTorrent: Github’s data from a firehose
Social media and success in open source projects


GitHub developers use rockstars to overcome overflow of news
Impression formation in online peer production: activity traces and personal profiles in github
Mining GitHub: Why Commit Stops — Exploring the Relationship between Developer’s Commit Pattern and File Version Evolution
Network Structure of Social Coding in GitHub
Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder
StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge
A Study of the Characteristics of Developers’ Activities in GitHub
Discovery of technical expertise from open source code repositories
Impact of social features implemented in open collaboration platforms on volunteer self-organization: case study of open source software development
Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects
Creating a shared understanding of testing culture on a social coding site
The GHTorent dataset and tool suite
Herding in open source software development: an exploratory study
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories
Analyzing the social ties and structure of contributors in open source software community
Bringing ultra-large-scale software repository mining to the masses with boa
Mining Developer Contribution in Open Source Software Using Visualization Techniques
Network structure of social coding in GitHub
Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder
StackOverflow and GitHub: associations between software development and crowdsourced knowledge
GitHub developers use rockstars to overcome overflow of news
Performance and participation in open source software on GitHub
The GitHub Open Source Development Process


Exploring the ecosystem of software developers on GitHub and other platforms
Population dynamics in open source communities: an ecological approach applied to github
Forge++: The Changing Landscape of FLOSS Development
Software developers are humans, too!
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code
The code-centric collaboration perspective: Evidence from github


Legit – Git Workflow for “Humans”

'Legit for git' logo

Legit – simplifying git by reducing its workflow to only couple of instructions

“Legit is a complementary command-line interface for Git, optimized for workflow simplicity. It is heavily inspired by GitHub for Mac.” As I am quoting those words, in March 2014, this OSS project already have 3,055 stars on the counter. I like the idea of reducing the code-revision workflow to snappy 5 commands, and I introduced it to my students recently. There is a small drawback – installing this additions requires sudo on the machine – git-legit is not a part of pypi packaging.

$ git sync
# Syncronizes current branch. Auto-merge/rebase, un/stash.

$ git switch <branch>
# Switches to branch. Stashes and restores unstaged changes.

$ git publish <branch>
# Publishes branch to remote server.

$ git unpublish <branch>
# Removes branch from remote server.

$ git branches
# Nice & pretty list of branches + publication status.