Choosing the proper machine learning method

Scikit-learn portal published a cheat sheet map for choosing a right estimator for the particular job. On the edge of map there are most common jobs: clustering, customization, regression and dimensions reduction. From the start point, graph asks a couple of questions on your problem which you want to solve. Firstly, it suggest to get more data if there are less than 50 observations 🙂 On the classification problem, possible given solving techniques are: Linear SVC, SGD Classifier or kernel approximation (for large datasets), Naive Bayes, KNeighbour Classifiers, SVC (ensemble classifiers).

Click to view larger

It would be great if somebody improved this map for more problems, frameworks (not only scikit-learn) and made a website for fast robust method suggestion (through question asking).

Also, check out a very similar (but much larger!) to the map described on dlib C++ Machine Learning library page.

Source: Jassim Moideen on “Big Data and Analytics” LinkedIn group


Where to put your scientific big data for sharing online?

Well, first of all, the big data used by researches must be given from hand to hand (this ad-hoc is the worst solution 🙂 ) to enable cooperation, teamwork etc.

Secondly, research financial grants often requires that data and code must be open-source and available freely.

Cloud computing of course is not always possible, and that’s the reason of creating this post. Researches use vast majority of statistical and data-mining tools which operate on local resources or at most remote databases (i.e. RapidMiner, R).

Dropbox and public folder – especially good solution if you have gained a lot of extra space because of limited promotions (i.e. after buying a Samsung smartphone).  I had around 100GB in a peak moment and was called a Dropbox Guru twice. After putting data into “Public” folder, you have there a thing in right-click-menu called” copy public link” which allows to download data anywhere from the world.

Amazon AWS cloud – popular within companies, but solution payable depending on the transfer used

Google Cloud Storageinfo at Google Developers

Rackspace Cloud Storage, Cloud CDN and Unlimited Online Storage by Rackspace

MS Azure Cloud Servicesinfo on Windows Azure

Amazon Personal cloud

Google Drive


Dedicated machine with cloud software – i.e. Synology NAS drives

Reasonable (only in case of lack of money) seems to me to create couple of cloud accounts and use them to put data after after separating it into multiple parts (depending on the size they have).

What are the best machine learning libraries for Python ?

I found recently an oDesk job which can match my interest. Client says: “I would like that predictor to be written in Python only, and leverage only publicly-available libraries (mlpy, scipy,scikit etc.)“. Well, it would be good idea to utilize more than one package and check for output F-score, so I googled the most known machine learning packages for Python, there they are:

MLPY and PyML seem to be the most known and mainstream choices. Regarding the list above – Anaconda Python distribution seems to include only scikit-learn package. On the other hand, if your task is connected with NLP only, NLTK package may be enough.