Why analyses of GitHub’s Open-source software teams of developers will continue it’s growth.

1) GitHub is *the thing*. It have a modern UI which follows current trends. It’s easy in use, it have only one mechanism of version control, which is of course – Git. It have it’s own culture and fans (e.g. octocat, gadgets, stickers, etc.). Despite the fact it is sometimes blocked (e.g. in China) and have short shortages, it is highly reliable and refreshes data on the web pages immediately after a single change made from the git client / protocol (Yes, Git is also a protocol).

2) GitHub have biggest number of users and projects. More than SourceForge.

3) GitHub don’t have advertisements on their website. And will never have as there is no such need for them. While SourceForge is currently packed with wide blocks of different advertisement (probably to keep their funds running), GitHub webpage is clean and feature-oriented.

4) Probably most important – rich API for developers and researchers. It made for creating solutions like GitHub Torrent (http://ghtorrent.org/). It allowed Google BigQuery to use GitHub timeline data. It’s possible to create your own local instance of MongoDB or MySQL database holding all events from the GitHub timeline. Thanks to fast and secure OAuth for webapps application like Open Source Report Card (https://osrc.dfm.io/) could be created.

5) Trend analysis on Google Scholar proves my point. Number of papers involving GitHub is increasing, while number of articles on SourceForge is decreasing. There is a small number of people in the World who make high quality FLOSS* research using only GitHub data, and they work is quite often cited, despite the fact it’s a new research (papers from 2014, 2015).

Source: self-made in Feb 2015
Source: self-made in Feb 2015

6) There are many externals apps which support continuous integration and management of OSS teams. Example of an automatic-build system is drone.io. There is research in Academia about possible task-assignment strategies in OSS teams as well as creating central planners for work distribution. And what’s most important – papers regarding possible quality models in FLOSS teams and results from analyzing teams on GitHub.

7) GitHub employees are present at many important conferences regarding FLOSS and / or web technologies. Ivan Žužak will be a speaker at one of workshops at 11th Intl. Conf. on Open Source Systems (Florence, 2015). They are very keen about making an impact and helping the open-source community.

There are 257 people from all over the globe working at GitHub. Meet them here.

8) There is a high quality manual for mining the GitHub, as well as know-how of avoiding perils in OSS analysis (e.g. forks vs mother repositories, push model vs. fork-push). Check out:

Eirini Kalliamvakou et.al., “The promises and perils of mining GitHub” – http://dl.acm.org/citation.cfm?id=2597074
“Analyzing Millions of GitHub Commits – Ilya Grigorik” – https://www.igvita.com/slides/2012/bigquery-github-strata.pdf

You can read more on my other post @(https://oskarj.wordpress.com/2014/05/07/playing-with-github-data-and-researching-oss-current-state-of-art/) from the previous year.. Feel free to leave a comment below.

*FLOSS – Free/Libre Open Source Software


Playing with GitHub data and researching OSS – current state of art

Report on current state of art in researching open source software and teams on GitHub

The main idea of this article is to present people who research teams and oss created on GitHub and software which using github-driven data to present more hidden characteristic of code repositories placed on GitHub.

– 4th March 2014

  1. Software implementations, tools and practical establishments

“The Open Source Report Card” is a portal available at the address http://osrc.dfm.io/. It is also an open source project developed on GitHub and licensed under the MIT License. At the top of the page they state a warning: “Dear recruiters, GitHub is not your C.V. and that these stats only provide a biased and one-sided view.”. Their website is also powered by FusionAds template (http://fusionads.net/). In the centre it have a textbox in which it’s possible to enter a valid GitHub user login, which will lead to an individual report card. Users is analysed on following categories: languages, schedule, organization membership, recent activity. A nick of most similar user is shown, and a list of 5 most similar in activity users is given as well.

GitHub visualizer is a website available at the address http://ghv.artzub.com/. It allows to create a visualization of work done in a chosen GitHub repository.

Coderstats.net is another example of a report card, it is a portal available at the address http://coderstats.net. It shows short summarization of repositories and languages used. There is no information about hours of activity and a similarity to other GitHub users.

Ohloh.net is a huge project considering open source. The portal is available at address http://www.ohloh.net/. They say they do indexing of 663,168 open source projects. Website gives options of viewing: people, projects, organizations through rankings or a search engine. They collect not only GitHub data, but also other repository providers. Portal encourages to create “FLOSS resumes”. Portals basically works by “claiming” a contribution which finally identifies a proper person to his/her project.

Octoboard (www.octoboard.com ) is a GitHub activity dashboard. Octoboard is based on GitHub Archive : each day, it scans new GitHub events archives and computes a few stats, with a 15 days history. You can see some general data on this page, or use menu for more information about language and history. Octoboard is an open source project built for the GitHub Data Challenge by Denis Roussel.

  1. Persons, companies, and research teams

In the beginning, there was a list of “Most active GitHub users” created by Paul Miller and available at page https://gist.github.com/paulmillr/2657075. For some time it was updated regularly, but the list recently received negative comments, and there is a better user list which got positive recognition, and available at the address https://brainjar.org/ranking. Creator of this list is Mikołaj Pawlikowski (https://github.com/seeker89).

Donnie Berkholz is an analyst at RedMonk and their resident postdoc. Recently he presented an article on his company blog: GitHub language trends and the fragmenting landscape. He shows trends on programming languages in GitHub regarding activity, and separately issues. He also analyses new users and new repos. He also analyzed topic of GitHub popularity: GitHub will hit 5 million users within a yearBAM! GitHub prediction nailed: 4M users in August, 5M in December.

  1. GitHub employees and their work

GitHub developer program was announced at 6th of March 2014. They state: “By joining the Developer Program, you’ll receive ongoing notifications about changes to our API. You’ll be eligible to receive early access on select feature releases, and can request a development license for GitHub Enterprise. You can also submit your work for consideration on the integrations page.” What I believe these are the main advantages of joining the program: official recognition of our work, getting access to the newest API before public release, and a possibility of having plan for private repositories for free. Team members get the “developer” badge at their profile page. Moreover, it gives a licence for using the GitHub and Octocat logo legally. An example of a developer badge owner is Rafał Chmiel (https://github.com/rafalchmiel) (update 02.05.14 – seems like he resigned from this program? Badge no longer visible) – he created a github-cheat-sheet (a witty list of less known and useful tricks for GitHub).

At this blog: http://johnnunemaker.com/analytics-at-github/ a team of programmers at GitHub explain how they created a statistic engine (for the web traffic).

A duet of Brian Doll (GitHub) and Ilya Gregorik (Google) presented a topic on “Analyzing Millions of GitHub Commits – what makes developers happy, angry, and everything in between?”

  1. Data mining, querying, and further more

GitHub html resources (e.g. use Python’s package – “Beautiful Soup” or a Selenium). All data is visible in a browser, but some data is drawn on demand. Many elements are shown through AJAX elements, thus it’s not always easy to get rich HTML data, but ones need to simulate a browser behaviour. Yet, we use this mechanism in example to create a dataset of dialogs (software available at https://github.com/wikiteams/linda-nlp).

GitHub API

Self-explanatory. A programming interface to ask GitHub for details. Limited by quotas (https://developer.github.com/v3/rate_limit/ ). Used well in out projects and reliable. Quota is no longer a problem – we use switching between accounts during scripts execution.

GitHub Archive

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. We downloaded whole GitHub archive and transformed the JSON documents into a mongo database on our servers.

GHTorrent project

GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database. They offer downloadable archives with database dumps, and online query tools to both mysql and mongodb aswell. The project is documented with a database relationship schema. Still, all the data is based on events, so don’t expect you will find there everything available physically from GitHub.

GitHub Big Query

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce. Google BigQuery added the GitHub timeline (data of all events, be sure to check explanation of a GitHub event here: https://developer.github.com/v3/activity/events/types/ ). Mining lot of data, especially when it comes to sophisticated queries, is expensive and requires a setup of paid Google account. BigQuery (BQ) is reportedly based on Dremel, a scalable, interactive ad hoc query system for analysis of read-only nested data. To use the data in BigQuery, it first must be uploaded to Google Storage and in a second step imported using the BigQuery HTTP API. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

GitHub explore page – available at the address https://github.com/explore is a list of trending repositories.

  1. 3rd party tools

Object-oriented GitHub API for Java – http://github.jcabi.com/. Version 0.7.5 as on 1th May 2014. They say that despite the fact “there are a few other Java adapters of Github API, our implementation has its advantages, including: all classes are private and implement public interfaces, out-of-the-box in-memory mock of Github server, all classes are truly immutable and thread-safe, every Github object gives GET/PATCH access to its raw JSON, HTTP request is accessible for modifications, and finally: entire Github API is available, at least through a configurable HTTP request”.

Programming languages statistics – http://langpop.com/ – is a portal where data of language popularity is aggregated from many data sources, including GitHub, Google search engine, Google code, etc. Very informative and interesting portal.

Coderwall (https://coderwall.com ) allows to build a page for a developer yet it does not aggregate data to create a profile, so it is a limited but interesting data source.

  1. Past and oncoming conferences

36th International Conference on Software Engineering, Hyderabad
OpenSym 2014 in Berlin, FLOSS track

  1. Publications

Here I list papers I found regarding research on open-source software. The only requirement I had is a fact the the article must mention GitHub in it’s text and key-words. I will make a more precise list later which will filter out research which is not using GitHub – driven data (i.e. SourceForge is still a base for a research, but this is changing).


Investigating the geography of open source software through GitHub

Visualizing collaboration and influence in the open-source software community


Social coding in GitHub: transparency and collaboration in an open software repository
Capacitated team formation problem on social networks
Biological Mutualistic Models Applied to Study Open Source Software Development
Social media and success in open source projects
Towards content-driven reputation for collaborative code repositories
GHTorrent: Github’s data from a firehose
Social media and success in open source projects


GitHub developers use rockstars to overcome overflow of news
Impression formation in online peer production: activity traces and personal profiles in github
Mining GitHub: Why Commit Stops — Exploring the Relationship between Developer’s Commit Pattern and File Version Evolution
Network Structure of Social Coding in GitHub
Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder
StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge
A Study of the Characteristics of Developers’ Activities in GitHub
Discovery of technical expertise from open source code repositories
Impact of social features implemented in open collaboration platforms on volunteer self-organization: case study of open source software development
Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects
Creating a shared understanding of testing culture on a social coding site
The GHTorent dataset and tool suite
Herding in open source software development: an exploratory study
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories
Analyzing the social ties and structure of contributors in open source software community
Bringing ultra-large-scale software repository mining to the masses with boa
Mining Developer Contribution in Open Source Software Using Visualization Techniques
Network structure of social coding in GitHub
Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder
StackOverflow and GitHub: associations between software development and crowdsourced knowledge
GitHub developers use rockstars to overcome overflow of news
Performance and participation in open source software on GitHub
The GitHub Open Source Development Process


Exploring the ecosystem of software developers on GitHub and other platforms
Population dynamics in open source communities: an ecological approach applied to github
Forge++: The Changing Landscape of FLOSS Development
Software developers are humans, too!
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code
The code-centric collaboration perspective: Evidence from github