*American Epilepsy Society Seizure Prediction Challenge*finished and Isaac (my Kaggle's alias) placed 8th.

# Technology Hyperboles

Self-conscious exaggerations about computing

## Monday, April 27, 2015

### Very proud of my 8th place / 504 teams - Kaggle's American Epilepsy Society Seizure Prediction Challenge

5 months have passed since the Kaggle's

## Monday, November 03, 2014

### How to paint a Van Gogh with R Caret ... and suicide immediately after!!

There's no yet the function

Why doing that? ... that's another story ... anyway, I find the use of red a bit excessive, so we can sell it as a Van Gogh of earlier years. Very important, there're no correlation with the problem, as results don't change.

And what about a Matisse? ...same information can be presented with

*paint(as.van.gogh(..),..)*, but it's already possibile to get a beautiful paint in Van Gogh style training no less than 150 models (perhaps, after 20/30 hours computing) with the same sampling algorithm and painting resampling results across models where each line corresponds to a common cross-validation holdout (aka*parallelplot).*Why doing that? ... that's another story ... anyway, I find the use of red a bit excessive, so we can sell it as a Van Gogh of earlier years. Very important, there're no correlation with the problem, as results don't change.

And what about a Matisse? ...same information can be presented with

*dotplot ...*and results don't delude.## Saturday, October 11, 2014

### Comparing Octave based SVMs vs caret SVMs (accuracy + fitting time)

In this post

Let's start Octave session with

So,

This is a hybrid choice, good for all seasons. It's the aggregation of 2 "pure architectures":

**caret R**package regression models has been compared, where the**solubility**data can be obtained from the**AppliedPredictiveModeling R**package and where- Models fitting on train set > 15 minutes has been discarded.
- Accuracy measure:
**RMSE**(Root Mean Squared Error)

From this, the top performing models are Support Vector Machines with and without Box–Cox transformations. Linear Regression / Partial Least Squares / Elastic Net with and without Box–Cox transformations are middle performing. Bagged trees / Conditional Inference Tree / CART showed modest results.

**SVMs with Box–Cox transformations performs on test set as 0.60797 RMSE**while without Box–Cox transformations as 0.61259.

Let's start Octave session with

**Regularized Polynomial Regression**where we got performances pretty similar to caret Elastic Net. We got

**0.71 RMSE**on test set with a

**10 polynomial degree and lambda 0.003**. From the validation curve we can see the model is under fitting.

Let's focus on

Time to fit on train set: 9 secs.

**SVMs**(fom**libsvm**package).**epsilon-****SVR**performs as**0.59466 RMSE**on test set with C = 13, gamma = 0.001536 and epsilon = 0.Time to fit on train set: 9 secs.

**nu-**

**SVR**performs as

**0.594129 RMSE**on test set with C = 13, gamma = 0.001466 and nu = 0.85

Time to fit on train set: 8 secs.

**Octave based SVMs have similar accuracy performances of caret SVMs**(0.59 vs 0.60 RMSE) on this data set (perhaps, a bit better), but

**they are much more fast in training (9 secs vs**

**424 secs**

**)**. In my experience, same considerations holds for memory consumption, but I'm not going to prove it here.

Let's go back to our

**on-line learning**applications. In that shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shipping service (y = 1) , sometimes not (y = 0). Features x captures properties of user, of origin/destination and asking price. We want to learn p(y = 1 | x) to optimize price.
Clearly, based on above example, Octave seems a much more performant and scalable choice than R. For instance, our

**application architecture**can be made of- presentation tier: bootstap js + JSP
- application tier:
**Octave**(Machine Learning) + Java (backoffice, monitoring tools, etc.) - data tier: MongoDB or MySql

This is a hybrid choice, good for all seasons. It's the aggregation of 2 "pure architectures":

- bootstap +
**Octave**+ MongoDB - JSP + Java +
**Octave**+ MySql

For both of them, the question is: is there any interface (open source?) JavaScript 2 Octave / Java 2 Octave / MySql 2 Octave / MongoDB 2 Octave? Are they stable enough for production? What about the community behind them?

## Sunday, September 28, 2014

### Comparing R caret models in action … and in practice: does model accuracy always matter more than scalability? and how much this is about models instead of implementations?

The post with code and plots is published on RPubs.

Here I report just parallel-coordinate plot for the resampling results across the models. Each line corresponds to a common cross-validation holdout.

Here I report just parallel-coordinate plot for the resampling results across the models. Each line corresponds to a common cross-validation holdout.

Is this a zero-sum game? As for bias and variance, it seems there’s a clear trade-off between accuracy and scalability. On the other hand, continuing the metaphor, as for machine learning problems I need to check there’s no additional noises in addition to bias, variance and irreducible errors, so here it's necessary to check that such a loss of scalability for top performer models is intrinsically bound to the problem and not to the implementation.

Is it possible to improve RMSE performances of linear regressors (that is middle performing in this contest) with an

**octave**based model? Similarly, is it possible to build a nu-SVR based model that improves caret SVM RMSE performance fitting on the training set in less than a minute?
…

*stay tuned*…## Saturday, April 05, 2014

## Tuesday, November 19, 2013

### An example of exploratory analysis in R (lattice package)

## Introduction

### Data set

The data consists of a sample of**2,500 peer-to-peer loans (= number of observations/samples)**issued through the

**Lending Club**. The interest rate of these loans is determined by the Lending Club on the basis of characteristics of the person asking for the loan such as their employment history, credit history, and credit worthiness scores. Such a data set (loansData.csv) is stored in working directory.

Let's load the data set and do some convenient operations aimed to traslate fake factor variables (e.g. Debt.To.Income.Ratio) into numeric variables.

```
data <- read.csv("loansData.csv")
data$MyInterest.Rate <- as.numeric(sub("%", "", data$Interest.Rate))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
doMean <- function(x) {
ret <- vector("numeric", length = length(x))
for (i in 1:length(x)) {
ret[i] <- (as.numeric(substr(x[i], 1, 3)) + as.numeric(substr(x[i],
5, 7)))/2
}
ret
}
data$FICO.Range.mean <- doMean(data$FICO.Range)
```

### Purpose of analysis

The purpose of analysis is to identify and quantify associations between the interest rate of the loan and the other variables in the data set. In particular, considering whether any of these variables have an important association with interest rate after taking into account the applicant's FICO score.## Methods and Results

### Bivariate analysis

Let's start considering the association between**Interest Rate**and

**FICO range**.

```
## par(mfrow=c(1,2))
plot(data$MyInterest.Rate, data$FICO.Range, pch = 19, col = "blue", cex = 0.5,
main = "Fig. 1 - The association between Interest Rate and FICO score range",
xlab = "FICO score range", ylab = "Interest rate")
```

```
boxplot(data$MyInterest.Rate ~ data$FICO.Range, col = terrain.colors(length(data$FICO.Range),
alpha = 0.8), varwidth = TRUE, main = "Fig. 2 - The association between Interest Rate and FICO score range",
xlab = "FICO range score", ylab = "Interest rate")
```

As showed by Fig. 1 and Fig. 2,

**Interest rate**(quantitative response variable) seems negatively associated with

**FICO score range**(categorical explanatory variable). In order to prove these variable are significantly (confidence level 95%) associated applying Pearson correlation, we consider the variable

**FICO score range mean**(quantitative) instead of

**FICO score range**(categorical).

```
cor.test(data$MyInterest.Rate, data$FICO.Range.mean, method = "pearson", conf.level = 0.95)
```

```
##
## Pearson's product-moment correlation
##
## data: data$MyInterest.Rate and data$FICO.Range.mean
## t = -50.26, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7281 -0.6891
## sample estimates:
## cor
## -0.7091
```

Hence, we can conclude **Interest rate**is negatively associated with

**FICO score range mean**(p-value < .0001). Moreover, if we know the

**FICO score range mean**, we can predict

**50,2%**(Adjusted R-squared: 0.5026) of the variability we will see in

**Interest rate**.

```
lm1 <- lm(data$MyInterest.Rate ~ data$FICO.Range.mean)
summary(lm1)
```

```
##
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07988 -0.02136 -0.00455 0.01837 0.10195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.29e-01 1.19e-02 61.2 <2e-16 ***
## data$FICO.Range.mean -8.46e-04 1.68e-05 -50.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0295 on 2498 degrees of freedom
## Multiple R-squared: 0.503, Adjusted R-squared: 0.503
## F-statistic: 2.53e+03 on 1 and 2498 DF, p-value: <2e-16
```

Are there other features statistically correlated to Interest rate explaining more variability than FICO range? ```
features = dim(data)[2]
pValue <- rep(NA, features)
r2 <- rep(NA, features)
for (i in 1:features) {
lm1sum <- summary(lm(data$MyInterest.Rate ~ data[, i]))
pValue[i] <- lm1sum$coeff[2, 4]
r2[i] <- lm1sum$adj.r.squared
}
data.frame(names(data), pValue, r2)
```

```
## names.data. pValue r2
## 1 Amount.Requested 1.545e-65 0.1101018
## 2 Amount.Funded.By.Investors 1.326e-67 0.1134749
## 3 Interest.Rate 0.000e+00 1.0000000
## 4 Loan.Length 1.772e-109 0.1791880
## 5 Loan.Purpose 1.654e-03 0.0322881
## 6 Debt.To.Income.Ratio 8.447e-01 0.0472510
## 7 State 1.575e-02 0.0036624
## 8 Home.Ownership 2.026e-01 0.0062175
## 9 Monthly.Income 5.396e-01 -0.0002497
## 10 FICO.Range 8.741e-01 0.5382865
## 11 Open.CREDIT.Lines 6.169e-06 0.0077580
## 12 Revolving.CREDIT.Balance 2.246e-03 0.0033351
## 13 Inquiries.in.the.Last.6.Months 1.216e-16 0.0267187
## 14 Employment.Length 3.751e-01 -0.0001105
## 15 MyInterest.Rate 0.000e+00 1.0000000
## 16 MyDebt.To.Income.Ratio 2.733e-18 0.0296123
## 17 FICO.Range.mean 0.000e+00 0.5026398
```

We found that several features are statistically correlated (confidence level 95%) to Interest rate but **FICO range mean**can

**predict its variability better than other variables**. After FICO range, the features statistically correlated to Interest rate that predict best its variability are

**Loan.Length**(18%)**Amount.Funded.By.Investors**(11%)**Amount.Requested**(11%)

### Potential moderators

Such a negative association holds also**for each loan purpose**/

**with and without home ownsership**/

**for each US state**or do these variables moderate the association between Interest rate and FICO score?

```
library(lattice)
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Loan.Purpose, panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780,
y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770,
y = 0.2)
}, data = data, as.table = TRUE, xlab = "FICO range score mean", ylab = "Interest rate",
main = "Fig. 3 - Interest rate vs. FICO range (mean) score for each loan purpose")
```

Looking at Fig. 3, we find that FICO scores

**explains better the variability of interest rate**in case of loan for

**education**(explain 75% variability, p < 0.001),

**vacation**(explain 71% variability, p < 0.001),

**medical**(explain 66% variability, p < 0.001),

**car**(explain 61% variability, p < 0.001) and

**house**(explain 60% variability, p < 0.001).

```
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Home.Ownership[data$Home.Ownership !=
"NONE"], data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780,
y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770,
y = 0.2)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 4 - Interest rate vs. FICO range with and without home ownership")
```

```
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$State[data$State !=
"MS" & data$State != "MD" & data$State != "IA"], data = data, as.table = TRUE,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
if (p > 0.001) {
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))),
x = 770, y = 0.2)
}
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 780, y = 0.15)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 5 - Interest rate vs. FICO range score for each US state")
```

As showed, such a statistically significant negative association is confirmed also

**for each loan purpose**/

**with and without home ownsership**/

**for each US state**. So,

**these variables don't moderate the association between Interest rate and FICO score**.

Just a note regarding analisys by state: in some cases there'are not enough observations to estimate coefficents (MS,MD,IA), while in some other cases there're enough observations to calculate coefficients but we have p > 0.001. For instance, in case of SD there're just 4 obs.

### Variables associated with interest rate at the same level of applicant's FICO score

```
data$FICO.Range.cut <- equal.count(data$FICO.Range.mean, 15)
xyplot(data$MyInterest.Rate ~ data$Monthly.Income | data$FICO.Range.cut, data = data,
as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 65000, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 65000,
y = 0.2)
}, xlab = "Montly Income", ylab = "Interest rate", main = "Fig. 6 - Interest rate vs. Montly Income in different FICO score levels")
```

```
xyplot(data$MyInterest.Rate ~ data$Open.CREDIT.Lines | data$FICO.Range.cut,
data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 30, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 30,
y = 0.2)
}, xlab = "Credit lines", ylab = "Interest rate", main = "Fig. 7 - Interest rate vs. Credit lines in different FICO score levels")
```

```
xyplot(data$MyInterest.Rate ~ data$MyDebt.To.Income.Ratio | data$FICO.Range.cut,
data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 0.2, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 0.2,
y = 0.2)
}, xlab = "Debt To Income Ratio", ylab = "Interest rate", main = "Fig. 8 - Interest rate vs. Debt To Income Ratio in diff. FICO score levels")
```

```
histogram(~data$MyInterest.Rate | data$FICO.Range.cut, data = data, xlab = "Interest rate",
main = "Fig. 9 - Interest rate distribution across diff. FICO score levels")
```

### Missing data or other unusual features

There are 7 missing values in the provided data set.```
sum(is.na(data))
```

```
## [1] 7
```

Regarding unusual features the list could be pretty long. Let's mention just the FICO score that is provided as factor variable and it's by grouped by range and it's not provided as numeric variable. ### Potential confounders

Credit scores are designed to measure the risk of default by taking into account various factors in a person's financial history. Although the exact formulas for calculating credit scores are secret, FICO has disclosed the following components:- (30%)
**Credit utilization**: the ratio of current revolving debt (such as credit card balances) to the total available revolving credit or credit. This components is probably correlated to**Debt.To.Income.Ratio** - (15%)
**Length of credit history**. This components is probably correlated to**Loan.Length** - (10%)
**Types of credit used**. This components is probably correlated to**Loan.Purpose** - (10%)
**Recent searches for credit**. This components is probably correlated to**Inquiries.in.the.Last.6.Months**

**All these hypotesis are true except the one regarding**the correlation between

**Loan.Length**and FICO score.

```
summary(lm(data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio))
```

```
##
## Call:
## lm(formula = data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.14 -26.61 -5.82 21.53 116.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 723.56 1.56 464.0 <2e-16 ***
## data$MyDebt.To.Income.Ratio -101.92 9.11 -11.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.2 on 2498 degrees of freedom
## Multiple R-squared: 0.0477, Adjusted R-squared: 0.0473
## F-statistic: 125 on 1 and 2498 DF, p-value: <2e-16
```

```
anova(lm(data$FICO.Range.mean ~ data$Loan.Length))
```

```
## Analysis of Variance Table
##
## Response: data$FICO.Range.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Length 1 459 459 0.37 0.54
## Residuals 2498 3066619 1228
```

```
anova(lm(data$FICO.Range.mean ~ data$Loan.Purpose))
```

```
## Analysis of Variance Table
##
## Response: data$FICO.Range.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Purpose 13 179846 13834 11.9 <2e-16 ***
## Residuals 2486 2887233 1161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
summary(lm(data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months))
```

```
##
## Call:
## lm(formula = data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.23 -28.23 -7.99 21.77 121.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 710.233 0.866 820.15 <2e-16
## data$Inquiries.in.the.Last.6.Months -2.620 0.567 -4.62 4e-06
##
## (Intercept) ***
## data$Inquiries.in.the.Last.6.Months ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.9 on 2496 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.00849, Adjusted R-squared: 0.0081
## F-statistic: 21.4 on 1 and 2496 DF, p-value: 3.95e-06
```

### A more powerful linear model

Let's build a multiple variable regression model with FICO range and other features statistically correlated to Interest rate that predict best its variability**FICO Range mean**(50%)**Loan.Length**(18%)**Amount.Funded.By.Investors**(11%)**Amount.Requested**(11%)

```
lm1sum <- summary(lm(data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested +
data$Amount.Funded.By.Investors + data$Loan.Length))
lm1sum
```

```
##
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested +
## data$Amount.Funded.By.Investors + data$Loan.Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09763 -0.01453 -0.00135 0.01271 0.10275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26e-01 8.53e-03 85.01 < 2e-16 ***
## data$FICO.Range.mean -8.75e-04 1.21e-05 -72.40 < 2e-16 ***
## data$Amount.Requested 6.69e-07 2.23e-07 3.00 0.00270 **
## data$Amount.Funded.By.Investors 7.44e-07 2.24e-07 3.33 0.00088 ***
## data$Loan.Length60 months 3.28e-02 1.12e-03 29.32 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0211 on 2495 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.746
## F-statistic: 1.83e+03 on 4 and 2495 DF, p-value: <2e-16
```

As we can see, this model is **statistically correlated to Interest rate**(p-value < .01). Moreover, with this model we can predict

**74,6%**(Adjusted R-squared: 0.746) of the variability we will see in

**Interest rate**.

## Conclusion

We found that the**features statistically correlated to Interest rate that predict best its variability**are

**FICO Range mean**(50%)**Loan.Length**(18%)**Amount.Funded.By.Investors**(11%)**Amount.Requested**(11%)

**Interest rate**and

**FICO range**, i.e. it is confirmed also

**for each loan purpose**/

**with and without home ownsership**/

**for each US state**(=these variables don't moderate the association between Interest rate and FICO score).

Finally, it's possible to build more powerful linear models with multiple features. As reference, we built one with the above 4 features. We found that it is

**statistically correlated to Interest rate**and that we can predict the

**74,6% of its variability**.

Subscribe to:
Posts (Atom)