Introduction
Data set
The data consists of a sample of 2,500 peer-to-peer loans (= number of observations/samples) issued through the Lending Club. The interest rate of these loans is determined by the Lending Club on the basis of characteristics of the person asking for the loan such as their employment history, credit history, and credit worthiness scores. Such a data set (loansData.csv) is stored in working directory.Let's load the data set and do some convenient operations aimed to traslate fake factor variables (e.g. Debt.To.Income.Ratio) into numeric variables.
data <- read.csv("loansData.csv")
data$MyInterest.Rate <- as.numeric(sub("%", "", data$Interest.Rate))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
doMean <- function(x) {
ret <- vector("numeric", length = length(x))
for (i in 1:length(x)) {
ret[i] <- (as.numeric(substr(x[i], 1, 3)) + as.numeric(substr(x[i],
5, 7)))/2
}
ret
}
data$FICO.Range.mean <- doMean(data$FICO.Range)
Purpose of analysis
The purpose of analysis is to identify and quantify associations between the interest rate of the loan and the other variables in the data set. In particular, considering whether any of these variables have an important association with interest rate after taking into account the applicant's FICO score.Methods and Results
Bivariate analysis
Let's start considering the association between Interest Rate and FICO range.## par(mfrow=c(1,2))
plot(data$MyInterest.Rate, data$FICO.Range, pch = 19, col = "blue", cex = 0.5,
main = "Fig. 1 - The association between Interest Rate and FICO score range",
xlab = "FICO score range", ylab = "Interest rate")
boxplot(data$MyInterest.Rate ~ data$FICO.Range, col = terrain.colors(length(data$FICO.Range),
alpha = 0.8), varwidth = TRUE, main = "Fig. 2 - The association between Interest Rate and FICO score range",
xlab = "FICO range score", ylab = "Interest rate")
As showed by Fig. 1 and Fig. 2, Interest rate (quantitative response variable) seems negatively associated with FICO score range (categorical explanatory variable). In order to prove these variable are significantly (confidence level 95%) associated applying Pearson correlation, we consider the variable FICO score range mean (quantitative) instead of FICO score range (categorical).
cor.test(data$MyInterest.Rate, data$FICO.Range.mean, method = "pearson", conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: data$MyInterest.Rate and data$FICO.Range.mean
## t = -50.26, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7281 -0.6891
## sample estimates:
## cor
## -0.7091
Hence, we can conclude Interest rate is negatively associated with FICO score range mean (p-value < .0001).
Moreover, if we know the FICO score range mean, we can predict 50,2% (Adjusted R-squared: 0.5026) of the variability we will see in Interest rate. lm1 <- lm(data$MyInterest.Rate ~ data$FICO.Range.mean)
summary(lm1)
##
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07988 -0.02136 -0.00455 0.01837 0.10195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.29e-01 1.19e-02 61.2 <2e-16 ***
## data$FICO.Range.mean -8.46e-04 1.68e-05 -50.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0295 on 2498 degrees of freedom
## Multiple R-squared: 0.503, Adjusted R-squared: 0.503
## F-statistic: 2.53e+03 on 1 and 2498 DF, p-value: <2e-16
Are there other features statistically correlated to Interest rate explaining more variability than FICO range? features = dim(data)[2]
pValue <- rep(NA, features)
r2 <- rep(NA, features)
for (i in 1:features) {
lm1sum <- summary(lm(data$MyInterest.Rate ~ data[, i]))
pValue[i] <- lm1sum$coeff[2, 4]
r2[i] <- lm1sum$adj.r.squared
}
data.frame(names(data), pValue, r2)
## names.data. pValue r2
## 1 Amount.Requested 1.545e-65 0.1101018
## 2 Amount.Funded.By.Investors 1.326e-67 0.1134749
## 3 Interest.Rate 0.000e+00 1.0000000
## 4 Loan.Length 1.772e-109 0.1791880
## 5 Loan.Purpose 1.654e-03 0.0322881
## 6 Debt.To.Income.Ratio 8.447e-01 0.0472510
## 7 State 1.575e-02 0.0036624
## 8 Home.Ownership 2.026e-01 0.0062175
## 9 Monthly.Income 5.396e-01 -0.0002497
## 10 FICO.Range 8.741e-01 0.5382865
## 11 Open.CREDIT.Lines 6.169e-06 0.0077580
## 12 Revolving.CREDIT.Balance 2.246e-03 0.0033351
## 13 Inquiries.in.the.Last.6.Months 1.216e-16 0.0267187
## 14 Employment.Length 3.751e-01 -0.0001105
## 15 MyInterest.Rate 0.000e+00 1.0000000
## 16 MyDebt.To.Income.Ratio 2.733e-18 0.0296123
## 17 FICO.Range.mean 0.000e+00 0.5026398
We found that several features are statistically correlated (confidence level 95%) to Interest rate but FICO range mean can predict its variability better than other variables. After FICO range, the features statistically correlated to Interest rate that predict best its variability are - Loan.Length (18%)
- Amount.Funded.By.Investors (11%)
- Amount.Requested (11%)
Potential moderators
Such a negative association holds also for each loan purpose / with and without home ownsership / for each US state or do these variables moderate the association between Interest rate and FICO score?library(lattice)
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Loan.Purpose, panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780,
y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770,
y = 0.2)
}, data = data, as.table = TRUE, xlab = "FICO range score mean", ylab = "Interest rate",
main = "Fig. 3 - Interest rate vs. FICO range (mean) score for each loan purpose")
Looking at Fig. 3, we find that FICO scores explains better the variability of interest rate in case of loan for education (explain 75% variability, p < 0.001), vacation (explain 71% variability, p < 0.001), medical (explain 66% variability, p < 0.001), car (explain 61% variability, p < 0.001) and house (explain 60% variability, p < 0.001).
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Home.Ownership[data$Home.Ownership !=
"NONE"], data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780,
y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770,
y = 0.2)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 4 - Interest rate vs. FICO range with and without home ownership")
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$State[data$State !=
"MS" & data$State != "MD" & data$State != "IA"], data = data, as.table = TRUE,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
if (p > 0.001) {
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))),
x = 770, y = 0.2)
}
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 780, y = 0.15)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 5 - Interest rate vs. FICO range score for each US state")
As showed, such a statistically significant negative association is confirmed also for each loan purpose / with and without home ownsership / for each US state. So, these variables don't moderate the association between Interest rate and FICO score.
Just a note regarding analisys by state: in some cases there'are not enough observations to estimate coefficents (MS,MD,IA), while in some other cases there're enough observations to calculate coefficients but we have p > 0.001. For instance, in case of SD there're just 4 obs.
Variables associated with interest rate at the same level of applicant's FICO score
data$FICO.Range.cut <- equal.count(data$FICO.Range.mean, 15)
xyplot(data$MyInterest.Rate ~ data$Monthly.Income | data$FICO.Range.cut, data = data,
as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 65000, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 65000,
y = 0.2)
}, xlab = "Montly Income", ylab = "Interest rate", main = "Fig. 6 - Interest rate vs. Montly Income in different FICO score levels")
xyplot(data$MyInterest.Rate ~ data$Open.CREDIT.Lines | data$FICO.Range.cut,
data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 30, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 30,
y = 0.2)
}, xlab = "Credit lines", ylab = "Interest rate", main = "Fig. 7 - Interest rate vs. Credit lines in different FICO score levels")
xyplot(data$MyInterest.Rate ~ data$MyDebt.To.Income.Ratio | data$FICO.Range.cut,
data = data, as.table = TRUE, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
## panel.loess(x,y)
lm1 <- lm(y ~ x)
lm1sum <- summary(lm1)
r2 <- lm1sum$adj.r.squared
p <- lm1sum$coefficients[2, 4]
panel.abline(lm1)
# panel.text(labels=x,x,y)
panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))),
x = 0.2, y = 0.15)
panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 0.2,
y = 0.2)
}, xlab = "Debt To Income Ratio", ylab = "Interest rate", main = "Fig. 8 - Interest rate vs. Debt To Income Ratio in diff. FICO score levels")
histogram(~data$MyInterest.Rate | data$FICO.Range.cut, data = data, xlab = "Interest rate",
main = "Fig. 9 - Interest rate distribution across diff. FICO score levels")
Missing data or other unusual features
There are 7 missing values in the provided data set.sum(is.na(data))
## [1] 7
Regarding unusual features the list could be pretty long. Let's mention just the FICO score that is provided as factor variable and it's by grouped by range and it's not provided as numeric variable. Potential confounders
Credit scores are designed to measure the risk of default by taking into account various factors in a person's financial history. Although the exact formulas for calculating credit scores are secret, FICO has disclosed the following components:- (30%) Credit utilization: the ratio of current revolving debt (such as credit card balances) to the total available revolving credit or credit. This components is probably correlated to Debt.To.Income.Ratio
- (15%) Length of credit history. This components is probably correlated to Loan.Length
- (10%) Types of credit used. This components is probably correlated to Loan.Purpose
- (10%) Recent searches for credit. This components is probably correlated to Inquiries.in.the.Last.6.Months
summary(lm(data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio))
##
## Call:
## lm(formula = data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.14 -26.61 -5.82 21.53 116.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 723.56 1.56 464.0 <2e-16 ***
## data$MyDebt.To.Income.Ratio -101.92 9.11 -11.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.2 on 2498 degrees of freedom
## Multiple R-squared: 0.0477, Adjusted R-squared: 0.0473
## F-statistic: 125 on 1 and 2498 DF, p-value: <2e-16
anova(lm(data$FICO.Range.mean ~ data$Loan.Length))
## Analysis of Variance Table
##
## Response: data$FICO.Range.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Length 1 459 459 0.37 0.54
## Residuals 2498 3066619 1228
anova(lm(data$FICO.Range.mean ~ data$Loan.Purpose))
## Analysis of Variance Table
##
## Response: data$FICO.Range.mean
## Df Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Purpose 13 179846 13834 11.9 <2e-16 ***
## Residuals 2486 2887233 1161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months))
##
## Call:
## lm(formula = data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.23 -28.23 -7.99 21.77 121.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 710.233 0.866 820.15 <2e-16
## data$Inquiries.in.the.Last.6.Months -2.620 0.567 -4.62 4e-06
##
## (Intercept) ***
## data$Inquiries.in.the.Last.6.Months ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.9 on 2496 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.00849, Adjusted R-squared: 0.0081
## F-statistic: 21.4 on 1 and 2496 DF, p-value: 3.95e-06
A more powerful linear model
Let's build a multiple variable regression model with FICO range and other features statistically correlated to Interest rate that predict best its variability- FICO Range mean (50%)
- Loan.Length (18%)
- Amount.Funded.By.Investors (11%)
- Amount.Requested (11%)
lm1sum <- summary(lm(data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested +
data$Amount.Funded.By.Investors + data$Loan.Length))
lm1sum
##
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested +
## data$Amount.Funded.By.Investors + data$Loan.Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09763 -0.01453 -0.00135 0.01271 0.10275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26e-01 8.53e-03 85.01 < 2e-16 ***
## data$FICO.Range.mean -8.75e-04 1.21e-05 -72.40 < 2e-16 ***
## data$Amount.Requested 6.69e-07 2.23e-07 3.00 0.00270 **
## data$Amount.Funded.By.Investors 7.44e-07 2.24e-07 3.33 0.00088 ***
## data$Loan.Length60 months 3.28e-02 1.12e-03 29.32 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0211 on 2495 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.746
## F-statistic: 1.83e+03 on 4 and 2495 DF, p-value: <2e-16
As we can see, this model is statistically correlated to Interest rate (p-value < .01).
Moreover, with this model we can predict 74,6% (Adjusted R-squared: 0.746) of the variability we will see in Interest rate. Conclusion
We found that the features statistically correlated to Interest rate that predict best its variability are- FICO Range mean (50%)
- Loan.Length (18%)
- Amount.Funded.By.Investors (11%)
- Amount.Requested (11%)
Finally, it's possible to build more powerful linear models with multiple features. As reference, we built one with the above 4 features. We found that it is statistically correlated to Interest rate and that we can predict the 74,6% of its variability.
6 comments:
Very good post - I am still going through evaluations of others and don't want to do them before finishing reading this.
My little contribution is another way to assign data$FICO.Range.mean values:
data$FICO.Range.mean = sapply(lapply(strsplit(as.character(data$FICO.Range), '-'), as.numeric), mean)
a little contribution but very tricky
thank you
Awesome stuff! Would be great if you can share the data so that we can use it to reproduce the results.
... back from a terrible flu ...
to whom interested, here it's possible to get data and codebook: https://www.dropbox.com/s/825jo7be7ntzshs/LoansData.zip
Post a Comment