To reduce the influence of randomness introduced by data split, 8-fold cross-validation has been repeated 3 times, i.e. 3-times 4-fold cross validation.
Data transformations, models and meta-models
Measured cross-validation score and Kaggle Public LB score in a wide range of configurations:
- scaling numerical numerical predictors or without scaling;
- removing high-correlated predictors or without removing them;
- clustering (train/test) data (2/4/8/14 clusters configurations; 2 different predictors for clustering) or without clustering data;
- stacking, i.e. training (k+1)-level learners to combine predictions of k-level learners or meta-learners, or without stacking (in case of stacking, 2 and 3 architectural layers used);
- using a wide range of algorithms (knn, cubist, xgbTree, enet, pls, gbm).
The idea is that if I am making some mistake in the process whose effect is creating a difference between cross-validation score and Kaggle Public LB score, then such difference should have a proper variance. In this case it would not be correct talking of bias.
On the choice of the t-test (paired difference test) for this problem, please see [Dietterich, 1998]. Assuming the null hypothesis as the means of cross-validation score and Kaggle Public LB score are equal., it results that difference in means with 95 percent confidence lies in the interval [-0.05392342, -0.02951592]. So,there is a bias.
Why such a bias
AS a tube_id has 3.4 +/- 2.9 different prices, if in a cross-validation training holdout set I have instances of tube_id occurring also in the related cross-validation test holdout set (maybe related to different quantities or quote dates) I am training my learner with a train set probably more correlated with the related test set than the whole train set is correlated with the whole test set (both public and private). And also, probably the related effect is more evident for low rmsle scores than for high rmlse scores.