Data Analysis Project 1
##
Introduction

###
Data set

The data consists of a sample of

**2,500 peer-to-peer loans (= number of observations/samples)** issued through the

**Lending Club**. The interest rate of these loans is determined by the Lending Club on the basis of characteristics of the person asking for the loan such as their employment history, credit history, and credit worthiness scores. Such a data set (loansData.csv) is stored in working directory.

Let's load the data set and do some convenient operations aimed to traslate fake factor variables (e.g. Debt.To.Income.Ratio) into numeric variables.

```
data <- read.csv("loansData.csv")
data$MyInterest.Rate <- as.numeric(sub("%", "", data$Interest.Rate))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
doMean <- function(x) {
ret <- vector("numeric", length = length(x))
for (i in 1:length(x)) {
ret[i] <- (as.numeric(substr(x[i], 1, 3)) + as.numeric(substr(x[i],
5, 7)))/2
}
ret
}
data$FICO.Range.mean <- doMean(data$FICO.Range)
```

###
Purpose of analysis

The purpose of analysis is to identify and quantify associations between the interest rate of the loan and the other variables in the data set. In particular, considering whether any of these variables have an important association with interest rate after taking into account the applicant's FICO score.

##
Methods and Results

###
Bivariate analysis

Let's start considering the association between

**Interest Rate** and

**FICO range**.

```
## par(mfrow=c(1,2))
plot(data$MyInterest.Rate, data$FICO.Range, pch = 19, col = "blue", cex = 0.5,
main = "Fig. 1 - The association between Interest Rate and FICO score range",
xlab = "FICO score range", ylab = "Interest rate")
```

```
boxplot(data$MyInterest.Rate ~ data$FICO.Range, col = terrain.colors(length(data$FICO.Range),
alpha = 0.8), varwidth = TRUE, main = "Fig. 2 - The association between Interest Rate and FICO score range",
xlab = "FICO range score", ylab = "Interest rate")
```

As showed by Fig. 1 and Fig. 2,

**Interest rate** (quantitative response variable) seems negatively associated with

**FICO score range** (categorical explanatory variable). In order to prove these variable are significantly (confidence level 95%) associated applying Pearson correlation, we consider the variable

**FICO score range mean** (quantitative) instead of

**FICO score range** (categorical).

```
cor.test(data$MyInterest.Rate, data$FICO.Range.mean, method = "pearson", conf.level = 0.95)
```

```
##
## Pearson's product-moment correlation
##
## data: data$MyInterest.Rate and data$FICO.Range.mean
## t = -50.26, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7281 -0.6891
## sample estimates:
## cor
## -0.7091
```

Hence, we can conclude

**Interest rate** is negatively associated with

**FICO score range mean** (p-value < .0001).
Moreover, if we know the

**FICO score range mean**, we can predict

**50,2%** (Adjusted R-squared: 0.5026) of the variability we will see in

**Interest rate**.

```
lm1 <- lm(data$MyInterest.Rate ~ data$FICO.Range.mean)
summary(lm1)
```

```
##
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07988 -0.02136 -0.00455 0.01837 0.10195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.29e-01 1.19e-02 61.2 <2e-16 ***
## data$FICO.Range.mean -8.46e-04 1.68e-05 -50.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0295 on 2498 degrees of freedom
## Multiple R-squared: 0.503, Adjusted R-squared: 0.503
## F-statistic: 2.53e+03 on 1 and 2498 DF, p-value: <2e-16
```

Are there other features statistically correlated to Interest rate explaining more variability than FICO range?

```
features = dim(data)[2]
pValue <- rep(NA, features)
r2 <- rep(NA, features)
for (i in 1:features) {
lm1sum <- summary(lm(data$MyInterest.Rate ~ data[, i]))
pValue[i] <- lm1sum$coeff[2, 4]
r2[i] <- lm1sum$adj.r.squared
}
data.frame(names(data), pValue, r2)
```

```
## names.data. pValue r2
## 1 Amount.Requested 1.545e-65 0.1101018
## 2 Amount.Funded.By.Investors 1.326e-67 0.1134749
## 3 Interest.Rate 0.000e+00 1.0000000
## 4 Loan.Length 1.772e-109 0.1791880
## 5 Loan.Purpose 1.654e-03 0.0322881
## 6 Debt.To.Income.Ratio 8.447e-01 0.0472510
## 7 State 1.575e-02 0.0036624
## 8 Home.Ownership 2.026e-01 0.0062175
## 9 Monthly.Income 5.396e-01 -0.0002497
## 10 FICO.Range 8.741e-01 0.5382865
## 11 Open.CREDIT.Lines 6.169e-06 0.0077580
## 12 Revolving.CREDIT.Balance 2.246e-03 0.0033351
## 13 Inquiries.in.the.Last.6.Months 1.216e-16 0.0267187
## 14 Employment.Length 3.751e-01 -0.0001105
## 15 MyInterest.Rate 0.000e+00 1.0000000
## 16 MyDebt.To.Income.Ratio 2.733e-18 0.0296123
## 17 FICO.Range.mean 0.000e+00 0.5026398
```

We found that several features are statistically correlated (confidence level 95%) to Interest rate but

**FICO range mean** can

**predict its variability better than other variables**. After FICO range, the features statistically correlated to Interest rate that predict best its variability are

**Loan.Length** (18%)
**Amount.Funded.By.Investors** (11%)
**Amount.Requested** (11%)

###
Potential moderators

Such a negative association holds also

**for each loan purpose** /

**with and without home ownsership** /

**for each US state** or do these variables moderate the association between Interest rate and FICO score?

```
library(lattice)
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Loan.Purpose, panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780,
y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770,
y = 0.2)
}, data = data, as.table = TRUE, xlab = "FICO range score mean", ylab = "Interest rate",
main = "Fig. 3 - Interest rate vs. FICO range (mean) score for each loan purpose")
```

Looking at Fig. 3, we find that FICO scores

**explains better the variability of interest rate** in case of loan for

**education** (explain 75% variability, p < 0.001),

**vacation** (explain 71% variability, p < 0.001),

**medical** (explain 66% variability, p < 0.001),

**car** (explain 61% variability, p < 0.001) and

**house** (explain 60% variability, p < 0.001).

```
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Home.Ownership[data$Home.Ownership !=
"NONE"], data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780,
y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770,
y = 0.2)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 4 - Interest rate vs. FICO range with and without home ownership")
```

```
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$State[data$State !=
"MS" & data$State != "MD" & data$State != "IA"], data = data, as.table = TRUE,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
if (p > 0.001) {
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))),
x = 770, y = 0.2)
}
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 780, y = 0.15)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 5 - Interest rate vs. FICO range score for each US state")
```

As showed, such a statistically significant negative association is confirmed also

**for each loan purpose** /

**with and without home ownsership** /

**for each US state**. So,

**these variables don't moderate the association between Interest rate and FICO score**.

Just a note regarding analisys by state: in some cases there'are not enough observations to estimate coefficents (MS,MD,IA), while in some other cases there're enough observations to calculate coefficients but we have p > 0.001. For instance, in case of SD there're just 4 obs.

###
Variables associated with interest rate at the same level of applicant's FICO score

```
data$FICO.Range.cut <- equal.count(data$FICO.Range.mean, 15)
xyplot(data$MyInterest.Rate ~ data$Monthly.Income | data$FICO.Range.cut, data = data,
as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 65000, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 65000,
y = 0.2)
}, xlab = "Montly Income", ylab = "Interest rate", main = "Fig. 6 - Interest rate vs. Montly Income in different FICO score levels")
```

```
xyplot(data$MyInterest.Rate ~ data$Open.CREDIT.Lines | data$FICO.Range.cut,
data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 30, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 30,
y = 0.2)
}, xlab = "Credit lines", ylab = "Interest rate", main = "Fig. 7 - Interest rate vs. Credit lines in different FICO score levels")
```

```
xyplot(data$MyInterest.Rate ~ data$MyDebt.To.Income.Ratio | data$FICO.Range.cut,
data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 0.2, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 0.2,
y = 0.2)
}, xlab = "Debt To Income Ratio", ylab = "Interest rate", main = "Fig. 8 - Interest rate vs. Debt To Income Ratio in diff. FICO score levels")
```

```
histogram(~data$MyInterest.Rate | data$FICO.Range.cut, data = data, xlab = "Interest rate",
main = "Fig. 9 - Interest rate distribution across diff. FICO score levels")
```

###
Missing data or other unusual features

There are 7 missing values in the provided data set.

```
sum(is.na(data))
```

```
## [1] 7
```

Regarding unusual features the list could be pretty long. Let's mention just the FICO score that is provided as factor variable and it's by grouped by range and it's not provided as numeric variable.

###
Potential confounders

Credit scores are designed to measure the risk of default by taking into account various factors in a person's financial history. Although the exact formulas for calculating credit scores are secret, FICO has disclosed the following components:

- (30%)
**Credit utilization**: the ratio of current revolving debt (such as credit card balances) to the total available revolving credit or credit. This components is probably correlated to **Debt.To.Income.Ratio**
- (15%)
**Length of credit history**. This components is probably correlated to **Loan.Length**
- (10%)
**Types of credit used**. This components is probably correlated to **Loan.Purpose**
- (10%)
**Recent searches for credit**. This components is probably correlated to **Inquiries.in.the.Last.6.Months**

**All these hypotesis are true except the one regarding** the correlation between

**Loan.Length** and FICO score.

```
summary(lm(data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio))
```

```
##
## Call:
## lm(formula = data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.14 -26.61 -5.82 21.53 116.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 723.56 1.56 464.0 <2e-16 ***
## data$MyDebt.To.Income.Ratio -101.92 9.11 -11.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.2 on 2498 degrees of freedom
## Multiple R-squared: 0.0477, Adjusted R-squared: 0.0473
## F-statistic: 125 on 1 and 2498 DF, p-value: <2e-16
```

```
anova(lm(data$FICO.Range.mean ~ data$Loan.Length))
```

```
## Analysis of Variance Table
##
## Response: data$FICO.Range.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Length 1 459 459 0.37 0.54
## Residuals 2498 3066619 1228
```

```
anova(lm(data$FICO.Range.mean ~ data$Loan.Purpose))
```

```
## Analysis of Variance Table
##
## Response: data$FICO.Range.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Purpose 13 179846 13834 11.9 <2e-16 ***
## Residuals 2486 2887233 1161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
summary(lm(data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months))
```

```
##
## Call:
## lm(formula = data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.23 -28.23 -7.99 21.77 121.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 710.233 0.866 820.15 <2e-16
## data$Inquiries.in.the.Last.6.Months -2.620 0.567 -4.62 4e-06
##
## (Intercept) ***
## data$Inquiries.in.the.Last.6.Months ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.9 on 2496 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.00849, Adjusted R-squared: 0.0081
## F-statistic: 21.4 on 1 and 2496 DF, p-value: 3.95e-06
```

###
A more powerful linear model

Let's build a multiple variable regression model with FICO range and other features statistically correlated to Interest rate that predict best its variability

**FICO Range mean** (50%)
**Loan.Length** (18%)
**Amount.Funded.By.Investors** (11%)
**Amount.Requested** (11%)

```
lm1sum <- summary(lm(data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested +
data$Amount.Funded.By.Investors + data$Loan.Length))
lm1sum
```

```
##
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested +
## data$Amount.Funded.By.Investors + data$Loan.Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09763 -0.01453 -0.00135 0.01271 0.10275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26e-01 8.53e-03 85.01 < 2e-16 ***
## data$FICO.Range.mean -8.75e-04 1.21e-05 -72.40 < 2e-16 ***
## data$Amount.Requested 6.69e-07 2.23e-07 3.00 0.00270 **
## data$Amount.Funded.By.Investors 7.44e-07 2.24e-07 3.33 0.00088 ***
## data$Loan.Length60 months 3.28e-02 1.12e-03 29.32 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0211 on 2495 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.746
## F-statistic: 1.83e+03 on 4 and 2495 DF, p-value: <2e-16
```

As we can see, this model is

**statistically correlated to Interest rate** (p-value < .01).
Moreover, with this model we can predict

**74,6%** (Adjusted R-squared: 0.746) of the variability we will see in

**Interest rate**.

##
Conclusion

We found that the

**features statistically correlated to Interest rate that predict best its variability** are

**FICO Range mean** (50%)
**Loan.Length** (18%)
**Amount.Funded.By.Investors** (11%)
**Amount.Requested** (11%)

Moreover, there're no moderators between

**Interest rate** and

**FICO range** , i.e. it is confirmed also

**for each loan purpose** /

**with and without home ownsership** /

**for each US state** (=these variables don't moderate the association between Interest rate and FICO score).

Finally, it's possible to build more powerful linear models with multiple features. As reference, we built one with the above 4 features. We found that it is

**statistically correlated to Interest rate** and that we can predict the

**74,6% of its variability**.