Why analyses of GitHub’s Open-source software teams of developers will continue it’s growth.

1) GitHub is *the thing*. It have a modern UI which follows current trends. It’s easy in use, it have only one mechanism of version control, which is of course – Git. It have it’s own culture and fans (e.g. octocat, gadgets, stickers, etc.). Despite the fact it is sometimes blocked (e.g. in China) and have short shortages, it is highly reliable and refreshes data on the web pages immediately after a single change made from the git client / protocol (Yes, Git is also a protocol).

2) GitHub have biggest number of users and projects. More than SourceForge.

3) GitHub don’t have advertisements on their website. And will never have as there is no such need for them. While SourceForge is currently packed with wide blocks of different advertisement (probably to keep their funds running), GitHub webpage is clean and feature-oriented.

4) Probably most important – rich API for developers and researchers. It made for creating solutions like GitHub Torrent (http://ghtorrent.org/). It allowed Google BigQuery to use GitHub timeline data. It’s possible to create your own local instance of MongoDB or MySQL database holding all events from the GitHub timeline. Thanks to fast and secure OAuth for webapps application like Open Source Report Card (https://osrc.dfm.io/) could be created.

5) Trend analysis on Google Scholar proves my point. Number of papers involving GitHub is increasing, while number of articles on SourceForge is decreasing. There is a small number of people in the World who make high quality FLOSS* research using only GitHub data, and they work is quite often cited, despite the fact it’s a new research (papers from 2014, 2015).

Source: self-made in Feb 2015
Source: self-made in Feb 2015

6) There are many externals apps which support continuous integration and management of OSS teams. Example of an automatic-build system is drone.io. There is research in Academia about possible task-assignment strategies in OSS teams as well as creating central planners for work distribution. And what’s most important – papers regarding possible quality models in FLOSS teams and results from analyzing teams on GitHub.

7) GitHub employees are present at many important conferences regarding FLOSS and / or web technologies. Ivan Žužak will be a speaker at one of workshops at 11th Intl. Conf. on Open Source Systems (Florence, 2015). They are very keen about making an impact and helping the open-source community.

There are 257 people from all over the globe working at GitHub. Meet them here.

8) There is a high quality manual for mining the GitHub, as well as know-how of avoiding perils in OSS analysis (e.g. forks vs mother repositories, push model vs. fork-push). Check out:

Eirini Kalliamvakou et.al., “The promises and perils of mining GitHub” – http://dl.acm.org/citation.cfm?id=2597074
“Analyzing Millions of GitHub Commits – Ilya Grigorik” – https://www.igvita.com/slides/2012/bigquery-github-strata.pdf

You can read more on my other post @(https://oskarj.wordpress.com/2014/05/07/playing-with-github-data-and-researching-oss-current-state-of-art/) from the previous year.. Feel free to leave a comment below.

*FLOSS – Free/Libre Open Source Software


Very good article on Singleton in Java – a part of Gang of Four design

Author makes comprehensive review of existing ways to implement singleton: lazy loading, enums, double check locking, and volatile keyword. The number of 225155 page views as on 13.10.2014 since 02.13.2013 speaks for itself. You can visit it here.

Quick glance on the introduction:

Singleton is a part of Gang of Four design pattern and it is categorized under creational design patterns.
In this article we are going to take a deeper look into the usage of the Singleton pattern. It is one of the most simple design pattern in terms of the modelling but on the other hand this is one of the most controversial pattern in terms of complexity of usage.
In Java the Singleton pattern will ensure that there is only one instance of a class is created in the Java Virtual Machine. It is used to provide global point of access to the object. In terms of practical use Singleton patterns are used in logging, caches, thread pools, configuration settings, device driver objects. Design pattern is often used in conjunction with Factory design pattern. This pattern is also used in Service Locator JEE pattern.

Singleton Design Pattern – An Introspection w/ Best Practices

Choosing the proper machine learning method

Scikit-learn portal published a cheat sheet map for choosing a right estimator for the particular job. On the edge of map there are most common jobs: clustering, customization, regression and dimensions reduction. From the start point, graph asks a couple of questions on your problem which you want to solve. Firstly, it suggest to get more data if there are less than 50 observations 🙂 On the classification problem, possible given solving techniques are: Linear SVC, SGD Classifier or kernel approximation (for large datasets), Naive Bayes, KNeighbour Classifiers, SVC (ensemble classifiers).

Click to view larger

It would be great if somebody improved this map for more problems, frameworks (not only scikit-learn) and made a website for fast robust method suggestion (through question asking).

Also, check out a very similar (but much larger!) to the map described on dlib C++ Machine Learning library page. http://dlib.net/ml_guide.svg

Source: Jassim Moideen on “Big Data and Analytics” LinkedIn group

7 Bad Naming Practices That Can Hurt a Domain’s Quality and Effectiveness

Evaluating domains is never cut-and-dry. Even with the most important criteria at hand, you can end up with a domain that is somehow still bad. It’s part of why being in the domain industry is challenging.

Some of these are things that circumstances might cause, making a good domain not nearly as good as it would otherwise be. Some are when going outside-the-box leads to practices that are rare because they’re bad.

In any event, you should generally try to avoid these practices described in article (link) below:



Translate.com is a website which offers text translations with “better” community enriched translations. Website allows both text input and audio input. Users can propose translations modifications. It works through many-to-many contributions, but at the right endpoint there is always a particular user with a sentence he requires help with. Translate.com offers a mobile application for Android and iOS systems. Mobile app have OCR module so it is possible to translate a text embedded into pictures. Yet, translate.com mobile app won’t allow real translations on demand, only a machine generated translation.

Visualizing my university facebook fanpage – PJWSTK

Netvizz starting from already a year or more will anonymize user names and surnames in an output produced from it’s execution. While Netvizz is of course a tool of choice, and works quite well, we will use here a NodeXL for creating a social network for further analysis.

NodeXL gives you also possibility to user additional plugins. If you look at the main website of Nodexl (http://nodexl.codeplex.com/) there should be a link called “search available plugins”. We are interested in the “Social Network Importer for NodeXL”. There may be some problems with installing it (at least i encountered them on Win 8.1 and Office 2013.. more info here: http://socialnetimporter.codeplex.com/discussions/543672

NodeXL as well as Netvizz produce as an output a file type graphml which is an XML-based file format for graphs. It can hold structures including directed, undirected, mixed graphs, hypergraphs, and application-specific attributes. It’s possible to attach a date for representing a dynamic graph (it requires data transformation because Gephi won’t understand time frames bundled into attributes.. you can always consider export to gexl format).

I’m not exactly satisfied with Gephi because it won’t work well with huge networks (despite the fact of extending memory size for Java Virtual Machine.. yes Gephi runs on Java).

I uploaded Gephi files and other which I used for analysing the PJWSTK fanpage (https://www.facebook.com/pjwstk) to a GitHub repository (https://github.com/oskar-j/pjwstk-fanpage-sna), which I plan to update from time to time with more sophisticated, more beautiful networks and their analysis 🙂

Layout with algorithm Force Atlas + YifanHu
Zoom on central interesting part
OpenOrd layout
Zoom on central part
3 times OpenOrd, last with expansion 30%
Zoom in
Labels of one of the communities

I also calculated some simple statistics.

In-degree distribution
Out-degree distribution
Degree distribution
Closeness Centrality Distribution

I need to contact my senpai beause I am not sure on export plugins which allow to make great html+javascript websites. I once saw a plugin which generates a page with some nice toolbar, but I don’t remeber the name. “Seadragon Web” plugin works so so, it only allows zooming on still images. Maybe you know some good plugin for Gephi exports to web pages ?

What is open collaboration ?

Open collaboration is a collaboration, that is:

  • egalitarian (everyone can join, no principled or artificial barriers to participation exist),
  • meritocratic (decisions and status are merit-based rather than imposed)
  • self-organizing (processes adapt to people rather than people adapt to pre-defined processes).

Main places to find open collaboration are on:

  • wikis, on Wikipedia and other Wikimedia Foundation projects;
  • in open source, in open data and open government initiatives, open innovation, citizen engineering, peer production, and so on.

Source: http://www.opensym.org/2012/09/28/definition-of-open-collaboration/