Choosing the proper machine learning method

Scikit-learn portal published a cheat sheet map for choosing a right estimator for the particular job. On the edge of map there are most common jobs: clustering, customization, regression and dimensions reduction. From the start point, graph asks a couple of questions on your problem which you want to solve. Firstly, it suggest to get more data if there are less than 50 observations 🙂 On the classification problem, possible given solving techniques are: Linear SVC, SGD Classifier or kernel approximation (for large datasets), Naive Bayes, KNeighbour Classifiers, SVC (ensemble classifiers).

machine-learning
Click to view larger

It would be great if somebody improved this map for more problems, frameworks (not only scikit-learn) and made a website for fast robust method suggestion (through question asking).

Also, check out a very similar (but much larger!) to the map described on dlib C++ Machine Learning library page. http://dlib.net/ml_guide.svg

Source: Jassim Moideen on “Big Data and Analytics” LinkedIn group

Advertisements

7 Bad Naming Practices That Can Hurt a Domain’s Quality and Effectiveness

Evaluating domains is never cut-and-dry. Even with the most important criteria at hand, you can end up with a domain that is somehow still bad. It’s part of why being in the domain industry is challenging.

Some of these are things that circumstances might cause, making a good domain not nearly as good as it would otherwise be. Some are when going outside-the-box leads to practices that are rare because they’re bad.

In any event, you should generally try to avoid these practices described in article (link) below:

http://domainate.wordpress.com/2011/12/13/7-bad-naming-practices-that-can-hurt-a-domains-quality-and-effectiveness/

translate.com

Translate.com is a website which offers text translations with “better” community enriched translations. Website allows both text input and audio input. Users can propose translations modifications. It works through many-to-many contributions, but at the right endpoint there is always a particular user with a sentence he requires help with. Translate.com offers a mobile application for Android and iOS systems. Mobile app have OCR module so it is possible to translate a text embedded into pictures. Yet, translate.com mobile app won’t allow real translations on demand, only a machine generated translation.

Visualizing my university facebook fanpage – PJWSTK

Netvizz starting from already a year or more will anonymize user names and surnames in an output produced from it’s execution. While Netvizz is of course a tool of choice, and works quite well, we will use here a NodeXL for creating a social network for further analysis.

NodeXL gives you also possibility to user additional plugins. If you look at the main website of Nodexl (http://nodexl.codeplex.com/) there should be a link called “search available plugins”. We are interested in the “Social Network Importer for NodeXL”. There may be some problems with installing it (at least i encountered them on Win 8.1 and Office 2013.. more info here: http://socialnetimporter.codeplex.com/discussions/543672

NodeXL as well as Netvizz produce as an output a file type graphml which is an XML-based file format for graphs. It can hold structures including directed, undirected, mixed graphs, hypergraphs, and application-specific attributes. It’s possible to attach a date for representing a dynamic graph (it requires data transformation because Gephi won’t understand time frames bundled into attributes.. you can always consider export to gexl format).

I’m not exactly satisfied with Gephi because it won’t work well with huge networks (despite the fact of extending memory size for Java Virtual Machine.. yes Gephi runs on Java).

I uploaded Gephi files and other which I used for analysing the PJWSTK fanpage (https://www.facebook.com/pjwstk) to a GitHub repository (https://github.com/oskar-j/pjwstk-fanpage-sna), which I plan to update from time to time with more sophisticated, more beautiful networks and their analysis 🙂

Layout with algorithm Force Atlas + YifanHu
Zoom on central interesting part
OpenOrd layout
Zoom on central part
3 times OpenOrd, last with expansion 30%
Zoom in
Labels of one of the communities

I also calculated some simple statistics.

In-degree distribution
Out-degree distribution
Degree distribution
Closeness Centrality Distribution

I need to contact my senpai beause I am not sure on export plugins which allow to make great html+javascript websites. I once saw a plugin which generates a page with some nice toolbar, but I don’t remeber the name. “Seadragon Web” plugin works so so, it only allows zooming on still images. Maybe you know some good plugin for Gephi exports to web pages ?

What is open collaboration ?

Open collaboration is a collaboration, that is:

  • egalitarian (everyone can join, no principled or artificial barriers to participation exist),
  • meritocratic (decisions and status are merit-based rather than imposed)
  • self-organizing (processes adapt to people rather than people adapt to pre-defined processes).

Main places to find open collaboration are on:

  • wikis, on Wikipedia and other Wikimedia Foundation projects;
  • in open source, in open data and open government initiatives, open innovation, citizen engineering, peer production, and so on.

Source: http://www.opensym.org/2012/09/28/definition-of-open-collaboration/

random data generator for C

Let’s consider whether this random number generator is ‘good’ or not.

random1 = rand() % 2;

There are two problems with this approach. One is that the low order bits of the random number generator are not particularly random, so neither with random1 be. On my machine, there’s a slight but measurable bias toward 0 with that. The second problem is that it’s not thread safe because rand stores hidden state. A better solution, if your compiler and library supports it, would be to use the C++11 `std::uniform_int_distribution. It looks complex, but it’s actually pretty easy to use. One way to do that (from Stroustrup) is like this:

int rand_int(int low, int high)
{
  static std::default_random_engine re {};
  usingDist = std::uniform_int_distribution<int>;
  staticDist uid {};
  return uid(re,Dist::param_type{low,high});
}

This still stores state, but at least the distribution is correct.

Source: http://codereview.stackexchange.com/questions/49614/text-based-adventure-game-with-too-many-conditional-statements

Playing with GitHub data and researching OSS – current state of art

Report on current state of art in researching open source software and teams on GitHub

The main idea of this article is to present people who research teams and oss created on GitHub and software which using github-driven data to present more hidden characteristic of code repositories placed on GitHub.

– 4th March 2014

  1. Software implementations, tools and practical establishments

“The Open Source Report Card” is a portal available at the address http://osrc.dfm.io/. It is also an open source project developed on GitHub and licensed under the MIT License. At the top of the page they state a warning: “Dear recruiters, GitHub is not your C.V. and that these stats only provide a biased and one-sided view.”. Their website is also powered by FusionAds template (http://fusionads.net/). In the centre it have a textbox in which it’s possible to enter a valid GitHub user login, which will lead to an individual report card. Users is analysed on following categories: languages, schedule, organization membership, recent activity. A nick of most similar user is shown, and a list of 5 most similar in activity users is given as well.

GitHub visualizer is a website available at the address http://ghv.artzub.com/. It allows to create a visualization of work done in a chosen GitHub repository.

Coderstats.net is another example of a report card, it is a portal available at the address http://coderstats.net. It shows short summarization of repositories and languages used. There is no information about hours of activity and a similarity to other GitHub users.

Ohloh.net is a huge project considering open source. The portal is available at address http://www.ohloh.net/. They say they do indexing of 663,168 open source projects. Website gives options of viewing: people, projects, organizations through rankings or a search engine. They collect not only GitHub data, but also other repository providers. Portal encourages to create “FLOSS resumes”. Portals basically works by “claiming” a contribution which finally identifies a proper person to his/her project.

Octoboard (www.octoboard.com ) is a GitHub activity dashboard. Octoboard is based on GitHub Archive : each day, it scans new GitHub events archives and computes a few stats, with a 15 days history. You can see some general data on this page, or use menu for more information about language and history. Octoboard is an open source project built for the GitHub Data Challenge by Denis Roussel.

  1. Persons, companies, and research teams

In the beginning, there was a list of “Most active GitHub users” created by Paul Miller and available at page https://gist.github.com/paulmillr/2657075. For some time it was updated regularly, but the list recently received negative comments, and there is a better user list which got positive recognition, and available at the address https://brainjar.org/ranking. Creator of this list is Mikołaj Pawlikowski (https://github.com/seeker89).

Donnie Berkholz is an analyst at RedMonk and their resident postdoc. Recently he presented an article on his company blog: GitHub language trends and the fragmenting landscape. He shows trends on programming languages in GitHub regarding activity, and separately issues. He also analyses new users and new repos. He also analyzed topic of GitHub popularity: GitHub will hit 5 million users within a yearBAM! GitHub prediction nailed: 4M users in August, 5M in December.

  1. GitHub employees and their work

GitHub developer program was announced at 6th of March 2014. They state: “By joining the Developer Program, you’ll receive ongoing notifications about changes to our API. You’ll be eligible to receive early access on select feature releases, and can request a development license for GitHub Enterprise. You can also submit your work for consideration on the integrations page.” What I believe these are the main advantages of joining the program: official recognition of our work, getting access to the newest API before public release, and a possibility of having plan for private repositories for free. Team members get the “developer” badge at their profile page. Moreover, it gives a licence for using the GitHub and Octocat logo legally. An example of a developer badge owner is Rafał Chmiel (https://github.com/rafalchmiel) (update 02.05.14 – seems like he resigned from this program? Badge no longer visible) – he created a github-cheat-sheet (a witty list of less known and useful tricks for GitHub).

At this blog: http://johnnunemaker.com/analytics-at-github/ a team of programmers at GitHub explain how they created a statistic engine (for the web traffic).

A duet of Brian Doll (GitHub) and Ilya Gregorik (Google) presented a topic on “Analyzing Millions of GitHub Commits – what makes developers happy, angry, and everything in between?”

  1. Data mining, querying, and further more

GitHub html resources (e.g. use Python’s package – “Beautiful Soup” or a Selenium). All data is visible in a browser, but some data is drawn on demand. Many elements are shown through AJAX elements, thus it’s not always easy to get rich HTML data, but ones need to simulate a browser behaviour. Yet, we use this mechanism in example to create a dataset of dialogs (software available at https://github.com/wikiteams/linda-nlp).

GitHub API

Self-explanatory. A programming interface to ask GitHub for details. Limited by quotas (https://developer.github.com/v3/rate_limit/ ). Used well in out projects and reliable. Quota is no longer a problem – we use switching between accounts during scripts execution.

GitHub Archive

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. We downloaded whole GitHub archive and transformed the JSON documents into a mongo database on our servers.

GHTorrent project

GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database. They offer downloadable archives with database dumps, and online query tools to both mysql and mongodb aswell. The project is documented with a database relationship schema. Still, all the data is based on events, so don’t expect you will find there everything available physically from GitHub.

GitHub Big Query

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce. Google BigQuery added the GitHub timeline (data of all events, be sure to check explanation of a GitHub event here: https://developer.github.com/v3/activity/events/types/ ). Mining lot of data, especially when it comes to sophisticated queries, is expensive and requires a setup of paid Google account. BigQuery (BQ) is reportedly based on Dremel, a scalable, interactive ad hoc query system for analysis of read-only nested data. To use the data in BigQuery, it first must be uploaded to Google Storage and in a second step imported using the BigQuery HTTP API. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

GitHub explore page – available at the address https://github.com/explore is a list of trending repositories.

  1. 3rd party tools

Object-oriented GitHub API for Java – http://github.jcabi.com/. Version 0.7.5 as on 1th May 2014. They say that despite the fact “there are a few other Java adapters of Github API, our implementation has its advantages, including: all classes are private and implement public interfaces, out-of-the-box in-memory mock of Github server, all classes are truly immutable and thread-safe, every Github object gives GET/PATCH access to its raw JSON, HTTP request is accessible for modifications, and finally: entire Github API is available, at least through a configurable HTTP request”.

Programming languages statistics – http://langpop.com/ – is a portal where data of language popularity is aggregated from many data sources, including GitHub, Google search engine, Google code, etc. Very informative and interesting portal.

Coderwall (https://coderwall.com ) allows to build a page for a developer yet it does not aggregate data to create a profile, so it is a limited but interesting data source.

  1. Past and oncoming conferences

36th International Conference on Software Engineering, Hyderabad
OpenSym 2014 in Berlin, FLOSS track

  1. Publications

Here I list papers I found regarding research on open-source software. The only requirement I had is a fact the the article must mention GitHub in it’s text and key-words. I will make a more precise list later which will filter out research which is not using GitHub – driven data (i.e. SourceForge is still a base for a research, but this is changing).

2010

Investigating the geography of open source software through GitHub
2011

Visualizing collaboration and influence in the open-source software community

2012

Social coding in GitHub: transparency and collaboration in an open software repository
Capacitated team formation problem on social networks
Biological Mutualistic Models Applied to Study Open Source Software Development
Social media and success in open source projects
Towards content-driven reputation for collaborative code repositories
GHTorrent: Github’s data from a firehose
Social media and success in open source projects

2013

GitHub developers use rockstars to overcome overflow of news
Impression formation in online peer production: activity traces and personal profiles in github
Mining GitHub: Why Commit Stops — Exploring the Relationship between Developer’s Commit Pattern and File Version Evolution
Network Structure of Social Coding in GitHub
Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder
StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge
A Study of the Characteristics of Developers’ Activities in GitHub
Discovery of technical expertise from open source code repositories
Impact of social features implemented in open collaboration platforms on volunteer self-organization: case study of open source software development
Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects
Creating a shared understanding of testing culture on a social coding site
The GHTorent dataset and tool suite
Herding in open source software development: an exploratory study
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories
Analyzing the social ties and structure of contributors in open source software community
Bringing ultra-large-scale software repository mining to the masses with boa
Mining Developer Contribution in Open Source Software Using Visualization Techniques
Network structure of social coding in GitHub
Social networking meets software development: Perspectives from github, msdn, stack exchange, and topcoder
StackOverflow and GitHub: associations between software development and crowdsourced knowledge
GitHub developers use rockstars to overcome overflow of news
Performance and participation in open source software on GitHub
The GitHub Open Source Development Process

2014

Exploring the ecosystem of software developers on GitHub and other platforms
Population dynamics in open source communities: an ecological approach applied to github
Forge++: The Changing Landscape of FLOSS Development
Software developers are humans, too!
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code
The code-centric collaboration perspective: Evidence from github