Tuesday, November 19, 2013

An example of exploratory analysis in R (lattice package)

Data Analysis Project 1

Introduction

Data set

The data consists of a sample of 2,500 peer-to-peer loans (= number of observations/samples) issued through the Lending Club. The interest rate of these loans is determined by the Lending Club on the basis of characteristics of the person asking for the loan such as their employment history, credit history, and credit worthiness scores. Such a data set (loansData.csv) is stored in working directory.
Let's load the data set and do some convenient operations aimed to traslate fake factor variables (e.g. Debt.To.Income.Ratio) into numeric variables.
data <- read.csv("loansData.csv")
data$MyInterest.Rate <- as.numeric(sub("%", "", data$Interest.Rate))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
data$MyDebt.To.Income.Ratio <- as.numeric(sub("%", "", data$Debt.To.Income.Ratio))/100
doMean <- function(x) {
    ret <- vector("numeric", length = length(x))
    for (i in 1:length(x)) {
        ret[i] <- (as.numeric(substr(x[i], 1, 3)) + as.numeric(substr(x[i], 
            5, 7)))/2
    }
    ret
}
data$FICO.Range.mean <- doMean(data$FICO.Range)

Purpose of analysis

The purpose of analysis is to identify and quantify associations between the interest rate of the loan and the other variables in the data set. In particular, considering whether any of these variables have an important association with interest rate after taking into account the applicant's FICO score.

Methods and Results

Bivariate analysis

Let's start considering the association between Interest Rate and FICO range.
## par(mfrow=c(1,2))
plot(data$MyInterest.Rate, data$FICO.Range, pch = 19, col = "blue", cex = 0.5, 
    main = "Fig. 1 - The association between Interest Rate and FICO score range", 
    xlab = "FICO score range", ylab = "Interest rate")
plot of chunk unnamed-chunk-3
boxplot(data$MyInterest.Rate ~ data$FICO.Range, col = terrain.colors(length(data$FICO.Range), 
    alpha = 0.8), varwidth = TRUE, main = "Fig. 2 - The association between Interest Rate and FICO score range", 
    xlab = "FICO range score", ylab = "Interest rate")
plot of chunk unnamed-chunk-3
As showed by Fig. 1 and Fig. 2, Interest rate (quantitative response variable) seems negatively associated with FICO score range (categorical explanatory variable). In order to prove these variable are significantly (confidence level 95%) associated applying Pearson correlation, we consider the variable FICO score range mean (quantitative) instead of FICO score range (categorical).
cor.test(data$MyInterest.Rate, data$FICO.Range.mean, method = "pearson", conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  data$MyInterest.Rate and data$FICO.Range.mean
## t = -50.26, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7281 -0.6891
## sample estimates:
##     cor 
## -0.7091
Hence, we can conclude Interest rate is negatively associated with FICO score range mean (p-value < .0001). Moreover, if we know the FICO score range mean, we can predict 50,2% (Adjusted R-squared: 0.5026) of the variability we will see in Interest rate.
lm1 <- lm(data$MyInterest.Rate ~ data$FICO.Range.mean)
summary(lm1)
## 
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07988 -0.02136 -0.00455  0.01837  0.10195 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.29e-01   1.19e-02    61.2   <2e-16 ***
## data$FICO.Range.mean -8.46e-04   1.68e-05   -50.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0295 on 2498 degrees of freedom
## Multiple R-squared:  0.503,  Adjusted R-squared:  0.503 
## F-statistic: 2.53e+03 on 1 and 2498 DF,  p-value: <2e-16
Are there other features statistically correlated to Interest rate explaining more variability than FICO range?
features = dim(data)[2]
pValue <- rep(NA, features)
r2 <- rep(NA, features)
for (i in 1:features) {
    lm1sum <- summary(lm(data$MyInterest.Rate ~ data[, i]))
    pValue[i] <- lm1sum$coeff[2, 4]
    r2[i] <- lm1sum$adj.r.squared
}

data.frame(names(data), pValue, r2)
##                       names.data.     pValue         r2
## 1                Amount.Requested  1.545e-65  0.1101018
## 2      Amount.Funded.By.Investors  1.326e-67  0.1134749
## 3                   Interest.Rate  0.000e+00  1.0000000
## 4                     Loan.Length 1.772e-109  0.1791880
## 5                    Loan.Purpose  1.654e-03  0.0322881
## 6            Debt.To.Income.Ratio  8.447e-01  0.0472510
## 7                           State  1.575e-02  0.0036624
## 8                  Home.Ownership  2.026e-01  0.0062175
## 9                  Monthly.Income  5.396e-01 -0.0002497
## 10                     FICO.Range  8.741e-01  0.5382865
## 11              Open.CREDIT.Lines  6.169e-06  0.0077580
## 12       Revolving.CREDIT.Balance  2.246e-03  0.0033351
## 13 Inquiries.in.the.Last.6.Months  1.216e-16  0.0267187
## 14              Employment.Length  3.751e-01 -0.0001105
## 15                MyInterest.Rate  0.000e+00  1.0000000
## 16         MyDebt.To.Income.Ratio  2.733e-18  0.0296123
## 17                FICO.Range.mean  0.000e+00  0.5026398
We found that several features are statistically correlated (confidence level 95%) to Interest rate but FICO range mean can predict its variability better than other variables. After FICO range, the features statistically correlated to Interest rate that predict best its variability are
  1. Loan.Length (18%)
  2. Amount.Funded.By.Investors (11%)
  3. Amount.Requested (11%)

Potential moderators

Such a negative association holds also for each loan purpose / with and without home ownsership / for each US state or do these variables moderate the association between Interest rate and FICO score?
library(lattice)
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Loan.Purpose, panel = function(x, 
    y, ...) {
    panel.xyplot(x, y, ...)
    lm1 <- lm(y ~ x)
    lm1sum <- summary(lm1)
    r2 <- lm1sum$adj.r.squared
    p <- lm1sum$coefficients[2, 4]
    panel.abline(lm1)
    panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780, 
        y = 0.15)
    panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770, 
        y = 0.2)
}, data = data, as.table = TRUE, xlab = "FICO range score mean", ylab = "Interest rate", 
    main = "Fig. 3 - Interest rate vs. FICO range (mean) score for each loan purpose")
plot of chunk unnamed-chunk-7
Looking at Fig. 3, we find that FICO scores explains better the variability of interest rate in case of loan for education (explain 75% variability, p < 0.001), vacation (explain 71% variability, p < 0.001), medical (explain 66% variability, p < 0.001), car (explain 61% variability, p < 0.001) and house (explain 60% variability, p < 0.001).
xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$Home.Ownership[data$Home.Ownership != 
    "NONE"], data = data, as.table = TRUE, panel = function(x, y, ...) {
    panel.xyplot(x, y, ...)
    lm1 <- lm(y ~ x)
    lm1sum <- summary(lm1)
    r2 <- lm1sum$adj.r.squared
    p <- lm1sum$coefficients[2, 4]
    panel.abline(lm1)
    panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), x = 780, 
        y = 0.15)
    panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 770, 
        y = 0.2)
}, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 4 - Interest rate vs. FICO range with and without home ownership")
plot of chunk unnamed-chunk-8

xyplot(data$MyInterest.Rate ~ data$FICO.Range.mean | data$State[data$State != 
    "MS" & data$State != "MD" & data$State != "IA"], data = data, as.table = TRUE, 
    panel = function(x, y, ...) {
        panel.xyplot(x, y, ...)
        lm1 <- lm(y ~ x)
        lm1sum <- summary(lm1)
        r2 <- lm1sum$adj.r.squared
        p <- lm1sum$coefficients[2, 4]
        panel.abline(lm1)
        if (p > 0.001) {
            panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), 
                x = 770, y = 0.2)
        }
        panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), 
            x = 780, y = 0.15)

    }, xlab = "FICO range score", ylab = "Interest rate", main = "Fig. 5 - Interest rate vs. FICO range score for each US state")
plot of chunk unnamed-chunk-8
As showed, such a statistically significant negative association is confirmed also for each loan purpose / with and without home ownsership / for each US state. So, these variables don't moderate the association between Interest rate and FICO score.
Just a note regarding analisys by state: in some cases there'are not enough observations to estimate coefficents (MS,MD,IA), while in some other cases there're enough observations to calculate coefficients but we have p > 0.001. For instance, in case of SD there're just 4 obs.

Variables associated with interest rate at the same level of applicant's FICO score

data$FICO.Range.cut <- equal.count(data$FICO.Range.mean, 15)
xyplot(data$MyInterest.Rate ~ data$Monthly.Income | data$FICO.Range.cut, data = data, 
    as.table = TRUE, panel = function(x, y, ...) {
        panel.xyplot(x, y, ...)
        ## panel.loess(x,y)
        lm1 <- lm(y ~ x)
        lm1sum <- summary(lm1)
        r2 <- lm1sum$adj.r.squared
        p <- lm1sum$coefficients[2, 4]
        panel.abline(lm1)
        # panel.text(labels=x,x,y)
        panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), 
            x = 65000, y = 0.15)
        panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 65000, 
            y = 0.2)
    }, xlab = "Montly Income", ylab = "Interest rate", main = "Fig. 6 - Interest rate vs. Montly Income in different FICO score levels")
plot of chunk unnamed-chunk-9


xyplot(data$MyInterest.Rate ~ data$Open.CREDIT.Lines | data$FICO.Range.cut, 
    data = data, as.table = TRUE, panel = function(x, y, ...) {
        panel.xyplot(x, y, ...)
        ## panel.loess(x,y)
        lm1 <- lm(y ~ x)
        lm1sum <- summary(lm1)
        r2 <- lm1sum$adj.r.squared
        p <- lm1sum$coefficients[2, 4]
        panel.abline(lm1)
        # panel.text(labels=x,x,y)
        panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), 
            x = 30, y = 0.15)
        panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 30, 
            y = 0.2)
    }, xlab = "Credit lines", ylab = "Interest rate", main = "Fig. 7 - Interest rate vs. Credit lines in different FICO score levels")
plot of chunk unnamed-chunk-9


xyplot(data$MyInterest.Rate ~ data$MyDebt.To.Income.Ratio | data$FICO.Range.cut, 
    data = data, as.table = TRUE, panel = function(x, y, ...) {
        panel.xyplot(x, y, ...)
        ## panel.loess(x,y)
        lm1 <- lm(y ~ x)
        lm1sum <- summary(lm1)
        r2 <- lm1sum$adj.r.squared
        p <- lm1sum$coefficients[2, 4]
        panel.abline(lm1)
        # panel.text(labels=x,x,y)
        panel.text(labels = bquote(italic(R)^2 == .(format(r2, digits = 3))), 
            x = 0.2, y = 0.15)
        panel.text(labels = bquote(italic(p) == .(format(p, digits = 3))), x = 0.2, 
            y = 0.2)
    }, xlab = "Debt To Income Ratio", ylab = "Interest rate", main = "Fig. 8 - Interest rate vs. Debt To Income Ratio in diff. FICO score levels")
plot of chunk unnamed-chunk-9

histogram(~data$MyInterest.Rate | data$FICO.Range.cut, data = data, xlab = "Interest rate", 
    main = "Fig. 9 - Interest rate distribution across diff. FICO score levels")
plot of chunk unnamed-chunk-9

Missing data or other unusual features

There are 7 missing values in the provided data set.
sum(is.na(data))
## [1] 7
Regarding unusual features the list could be pretty long. Let's mention just the FICO score that is provided as factor variable and it's by grouped by range and it's not provided as numeric variable.

Potential confounders

Credit scores are designed to measure the risk of default by taking into account various factors in a person's financial history. Although the exact formulas for calculating credit scores are secret, FICO has disclosed the following components:
  • (30%) Credit utilization: the ratio of current revolving debt (such as credit card balances) to the total available revolving credit or credit. This components is probably correlated to Debt.To.Income.Ratio
  • (15%) Length of credit history. This components is probably correlated to Loan.Length
  • (10%) Types of credit used. This components is probably correlated to Loan.Purpose
  • (10%) Recent searches for credit. This components is probably correlated to Inquiries.in.the.Last.6.Months
All these hypotesis are true except the one regarding the correlation between Loan.Length and FICO score.
summary(lm(data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio))
## 
## Call:
## lm(formula = data$FICO.Range.mean ~ data$MyDebt.To.Income.Ratio)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79.14 -26.61  -5.82  21.53 116.52 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   723.56       1.56   464.0   <2e-16 ***
## data$MyDebt.To.Income.Ratio  -101.92       9.11   -11.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.2 on 2498 degrees of freedom
## Multiple R-squared:  0.0477, Adjusted R-squared:  0.0473 
## F-statistic:  125 on 1 and 2498 DF,  p-value: <2e-16
anova(lm(data$FICO.Range.mean ~ data$Loan.Length))
## Analysis of Variance Table
## 
## Response: data$FICO.Range.mean
##                    Df  Sum Sq Mean Sq F value Pr(>F)
## data$Loan.Length    1     459     459    0.37   0.54
## Residuals        2498 3066619    1228
anova(lm(data$FICO.Range.mean ~ data$Loan.Purpose))
## Analysis of Variance Table
## 
## Response: data$FICO.Range.mean
##                     Df  Sum Sq Mean Sq F value Pr(>F)    
## data$Loan.Purpose   13  179846   13834    11.9 <2e-16 ***
## Residuals         2486 2887233    1161                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months))
## 
## Call:
## lm(formula = data$FICO.Range.mean ~ data$Inquiries.in.the.Last.6.Months)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68.23 -28.23  -7.99  21.77 121.77 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          710.233      0.866  820.15   <2e-16
## data$Inquiries.in.the.Last.6.Months   -2.620      0.567   -4.62    4e-06
##                                        
## (Intercept)                         ***
## data$Inquiries.in.the.Last.6.Months ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.9 on 2496 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.00849,    Adjusted R-squared:  0.0081 
## F-statistic: 21.4 on 1 and 2496 DF,  p-value: 3.95e-06

A more powerful linear model

Let's build a multiple variable regression model with FICO range and other features statistically correlated to Interest rate that predict best its variability
  1. FICO Range mean (50%)
  2. Loan.Length (18%)
  3. Amount.Funded.By.Investors (11%)
  4. Amount.Requested (11%)
lm1sum <- summary(lm(data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested + 
    data$Amount.Funded.By.Investors + data$Loan.Length))
lm1sum
## 
## Call:
## lm(formula = data$MyInterest.Rate ~ data$FICO.Range.mean + data$Amount.Requested + 
##     data$Amount.Funded.By.Investors + data$Loan.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09763 -0.01453 -0.00135  0.01271  0.10275 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      7.26e-01   8.53e-03   85.01  < 2e-16 ***
## data$FICO.Range.mean            -8.75e-04   1.21e-05  -72.40  < 2e-16 ***
## data$Amount.Requested            6.69e-07   2.23e-07    3.00  0.00270 ** 
## data$Amount.Funded.By.Investors  7.44e-07   2.24e-07    3.33  0.00088 ***
## data$Loan.Length60 months        3.28e-02   1.12e-03   29.32  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0211 on 2495 degrees of freedom
## Multiple R-squared:  0.746,  Adjusted R-squared:  0.746 
## F-statistic: 1.83e+03 on 4 and 2495 DF,  p-value: <2e-16
As we can see, this model is statistically correlated to Interest rate (p-value < .01). Moreover, with this model we can predict 74,6% (Adjusted R-squared: 0.746) of the variability we will see in Interest rate.

Conclusion

We found that the features statistically correlated to Interest rate that predict best its variability are
  1. FICO Range mean (50%)
  2. Loan.Length (18%)
  3. Amount.Funded.By.Investors (11%)
  4. Amount.Requested (11%)
Moreover, there're no moderators between Interest rate and FICO range , i.e. it is confirmed also for each loan purpose / with and without home ownsership / for each US state (=these variables don't moderate the association between Interest rate and FICO score).
Finally, it's possible to build more powerful linear models with multiple features. As reference, we built one with the above 4 features. We found that it is statistically correlated to Interest rate and that we can predict the 74,6% of its variability.