Thursday, November 05, 2015

PippoProxy reached ~3000 downloads

Ten years ago PippoProxy was born and, although from that moment ~6 Java versions have been released, I'm very proud of this well crafted piece of code. The SourceForge release  (at the time GitHub wasn't yet born) was called CheSpettacolo in honor to Valentino Rossi the Italian motorcycle racer and many times MotoGP world champion who, at that time, was winning one of first MotoGP. I was senior software architect at the N.1 Italian web portal and I was working on search engines and text mining projects....  

Wednesday, August 19, 2015

Collateral effects of a bad resampling procedure

Resampling technique
To reduce the influence of randomness introduced by data split, 8-fold cross-validation has been repeated 3 times, i.e.  3-times 4-fold cross validation.

Data transformations, models and meta-models 
Measured cross-validation score and Kaggle Public LB score in a wide range of configurations:
  • scaling numerical numerical predictors or without scaling;
  • removing high-correlated predictors or without removing them;
  • clustering (train/test) data (2/4/8/14 clusters configurations; 2 different predictors for clustering) or without clustering data;  
  • stacking, i.e. training (k+1)-level learners to combine predictions of k-level learners or meta-learners, or without stacking (in case of stacking, 2 and 3 architectural layers used);
  • using a wide range of algorithms (knn, cubist, xgbTree, enet, pls, gbm). 
The idea is that if I am making some mistake in the process whose effect is creating a difference between cross-validation score and Kaggle Public LB score, then such difference should have a proper variance. In this case it would not be correct talking of bias. 


On the choice of the t-test (paired difference test) for this problem, please see [Dietterich, 1998]. Assuming the null hypothesis as the means of cross-validation score and Kaggle Public LB score are equal., it results that difference in means with 95 percent confidence lies in the interval [-0.05392342, -0.02951592]. So,there is a bias.

Why such a bias
AS a tube_id has 3.4 +/- 2.9 different prices, if in a cross-validation training holdout set I have instances of tube_id occurring also in the related cross-validation test holdout set (maybe related to different quantities or quote dates) I am training my learner with a train set probably more correlated with the related test set than the whole train set is correlated with the whole test set (both public and private). And also, probably the related effect is more evident for low rmsle scores than for high rmlse scores. 

Friday, July 24, 2015

Technical features correlation vs Cost - Kaggle's Caterpillar Tube Pricing

Wondering if for higher levels of quantity tube technical features are less correlated to the selling price. This could be pretty expectable as the more the quantity the more the discount Caterpillar buyers are likely to ask.
Here I uploaded a script showing this fact for tube diameter (here you find data). Output plot is reported here below. As you can see the slope of the linear model becomes flatter for higher levels of quantity.

Monday, April 27, 2015

Very proud of my 8th place / 504 teams - Kaggle's American Epilepsy Society Seizure Prediction Challenge

5 months have passed since the Kaggle's American Epilepsy Society Seizure Prediction Challenge finished and Isaac (my Kaggle's alias) placed 8th.  

Monday, November 03, 2014

How to paint a Van Gogh with R Caret ... and suicide immediately after!!

There's no yet the function paint(as.van.gogh(..),..) , but it's already possibile to get a beautiful paint in Van Gogh style training no less than 150 models (perhaps, after 20/30 hours computing) with the same sampling algorithm and painting  resampling results across models where each line corresponds to a common cross-validation holdout (aka  parallelplot). 

Why doing that? ... that's another story ... anyway, I find the use of red a bit excessive, so we can sell it as a Van Gogh of earlier years. Very important, there're no correlation with the problem, as results don't change.
And what about a Matisse? ...same information can be presented with dotplot ... and results don't delude.

Saturday, October 11, 2014

Comparing Octave based SVMs vs caret SVMs (accuracy + fitting time)

In this post caret R package regression models has been compared, where the solubility data can be obtained from the AppliedPredictiveModeling R package and where 
  • Models fitting on train set > 15 minutes has been discarded.
  • Accuracy measure: RMSE (Root Mean Squared Error)
From this, the top performing models are Support Vector Machines with and without Box–Cox transformations. Linear Regression / Partial Least Squares / Elastic Net with and without Box–Cox transformations are middle performing. Bagged trees / Conditional Inference Tree / CART showed modest results.
SVMs with Box–Cox transformations performs on test set as 0.60797 RMSE while without Box–Cox transformations as 0.61259.

Let's start Octave session with Regularized Polynomial Regression where we got performances pretty similar to caret Elastic Net. We got 0.71 RMSE on test set with a 10 polynomial degree and lambda 0.003. From the validation curve we can see the model is under fitting.    

Let's focus on SVMs (fom libsvm package).
epsilon-SVR performs as 0.59466 RMSE on test set with C = 13, gamma = 0.001536 and epsilon = 0.
Time to fit on train set: 9 secs. 

nu-SVR performs as 0.594129 RMSE on test set with C = 13, gamma = 0.001466 and nu = 0.85
Time to fit on train set: 8 secs.

So, Octave based SVMs have similar accuracy performances of caret SVMs (0.59 vs 0.60 RMSE) on this data set (perhaps, a bit better), but they are much more fast in training (9 secs vs 424 secs)In my experience, same considerations holds for memory consumption, but I'm not going to prove it here.   

Let's go back to our on-line learning applications. In that shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shipping service (y = 1) , sometimes not (y = 0). Features x captures properties of user, of origin/destination and asking price. We want to learn p(y = 1 | x) to optimize price.
Clearly, based on above example, Octave seems a much more performant and scalable choice than R. For instance, our application architecture can be made of 
  • presentation tier: bootstap js + JSP 
  • application tier: Octave (Machine Learning) + Java (backoffice, monitoring tools, etc.)
  • data tier: MongoDB or MySql  

This is a hybrid choice, good for all seasons. It's the aggregation of 2 "pure architectures":
  • bootstap + Octave + MongoDB 
  • JSP + Java + Octave MySql 
For both of them, the question is: is there any interface (open source?) JavaScript 2 Octave / Java 2 Octave / MySql 2 Octave / MongoDB 2 Octave? Are they stable enough for production? What about the community behind them?