oleh dubno lending club loan data -...
TRANSCRIPT
Predicting Defaults of Loans using Lending Club’s Loan Data Oleh Dubno Fall 2014 General Assembly – Data Science Link to my Developer Notebook (ipynb) -‐ http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246 Background and Hypothesis: The data is coming from Lending Club, a peer-‐to-‐peer lending company, headquartered in San Francisco. LC began by operating as an online consumer-‐lending platform that enables borrowers to obtain a loan that’s funded by individuals and institutions. LC, just recently made their loans available to small businesses. I will be focusing on the prior. The 2007 -‐ 2011 dataset and the associated description of its features are downloadable on the LC site. It comes equipped with 188,127 values and 31 features. Goal: Discover the features that are indicative of someone paying or defaulting on their loan. Tools: Logistic regression, Naïve Bayes, Decision Tree To determine which features of the data set contribute towards someone repaying or defaulting on his or her loan and using the Decision Tree to see how well the model performs against a test set. Folium To map the features of the dataset.
By initially mapping a bar chart of the loan statuses, seven unique values become discoverable. To do the logistic regression only two are required. (see figure below) The focus is around predicting who repays or defaults on their loan. As a result, the “Current” column will be removed, the “Fully Paid” column will remain and the rest of the columns will be grouped and characterized as “Unpaid”. This is then converted to Boolean values: Unpaid 0 and Paid 1.
The data has now been drastically reduced. Given that “Current” is a heavy hitter, removing it reduces the dataset to 54,419 entries. This is necessary, provided the goal is not to focus on current loans. Data Overview The average funded amount of an individual loan is $13,924.27. The minimum loan given out is $1,00.00 with a median amount of $12,000 and a maximum amount of just $35,000.00. The funded amount is normally distributed and the numbers do not appear to be skewed. Good!
The average annual income is $71,833.82 with a minimum income of $4,800, a median of $62,000 and a maximum income of $7,141,778. The maximum value serves as a definite outlier and the set will be limited to $200,000. Not surprisingly, as Annual Income goes up so does the Funded Amount. The sweet spot, after which Annual Income does not predict Funded Amount, seems to be at about the mean of the annual income itself of $72,000.
I suppose the mean annual income of $72,000 matches the cut off for loans at $35,000 for good reasons. Interestingly, Lending club seems to have a strict policy, limiting the Amount Funded according to the individual Annual Income, up to $72,000, after which it begins to vary.
Lets run an OLS regression using Annual Income (predictor) to predict Amount Funded (the explained variable). OLS (Ordinary Least Squares) attempts to predict the dependent variable, Amount Funded, using the independent variable Annual Income. The regression algorithm “learns” from this data to predict the right Amount Funded given the Annual Income. The OLS regression with Annual Income is set to predict Amount Funded (limiting the dataset to income <= $200,000) shows an R^2 of .201 This means that 20% of the variance in Funded Amount is explained by Annual Income. This, however, is a low R^2. With the assistance of the scatter plot, we do see that Annual Income is suggestive in determining the Funded Amount only up until the Annual Income of $72,000. Logistic Regression Next: 4 Logistic Regressions Determining Loan Status The first logistic regression is using the time of employment and the grade that the loan received from LC to predict loan status. Below is a chart highlighting the coefficients. Coefficients represent the mean change in the response variable for one unit of change in the predictor variable. In other words, a 1 year increase in employment length increases the chance of the loan being paid back by 0.016. A 2 year increase in employment length increases the chance of the loan being paid back by 0.0320, and so on.
It would be interesting to see how effective the grade, that LC provides their loans, is at predicting loan status. Some background. The provided grades range from “A – G”: “A” being the highest and “G” the lowest. As a result I mapped “7”, the highest
value, to “A”, “6” to “B”, “5” to C” and so on until “1” as “G”. As the grade increases by 1 grade value the chance of the loan being paid off increases by 0.31. Given we’re using binary output of 0 as unpaid and 1 as paid. The closer the multiple of the grade and the coefficient is to 1 the higher the likelihood of the loan being paid off. Pretty much, if the grade is “E” or “3” the chance of payback is very high.
The second logistic regression is using funded amount and annual income to predict loan status. The reason for such low coefficients, for funded amount and annual income, is that the numbers are in thousands, granted they're in dollar amounts, and the explained variable, loan status, is binary ranging from 0 to 1.
Let's look at the amount funded. As the amount funded increases by $10,000 the chance of it getting paid back decreases by -‐0.238 = (10,000 x -‐0.0000238). Similar, as annual income increases so does the chance of the loan being paid off. Intuitive, right?
This is understandable and supported by the positive coefficient 0.0000202. In other words as the annual income increases by $10,000 so does the chance of the loan being paid back by 0.230 (10,000 x 0.0000230) The third logistic regression is using home ownership status (Rent, Mortgage, Own, None, Other) to predict loan status.
My understanding for someone putting “OTHER” for home ownership on the loan application is that they either did not want to reveal their home ownership situation, are hiding something, or are bad at filling out applications. “None” could be an honest answer, from someone that may be living with their parents. Regardless, it seems that if someone checks off “OTHER” and gets funded, then there’s a very good chance of that individual defaulting on his or her loan.
The fourth logistic regression is using employment length (<1 year – 10+ years) to predict loan status.
There doesn’t immediately appear to be too much variance between the generated coefficients of years employed. It looks like; so long as the person is employed they will be paying back their loan. However, it holds true, that if someone is unemployed or has less than a year of employment then they’ll have a lower chance of repaying their loan. I didn’t investigate which percentage of “<1 year” is employed or unemployed. Interestingly, and probably just a coincidence, because the results are really marginal, if a person is employed for 4 years they have the same coefficient of paying back their loan as someone employed for one year or less. Just an observation. I will not be pursuing that point any further. To conclude the work on logistic regression: the data set is deficient in explored features that I lacked, in experience leveraged with time, to explore. From the findings that I got, I can’t speak definitively, but I would say avoid giving loans to people that don’t specify home ownership and do give loans to people with higher income.
Decision Tree and The Confusion Matrix
Confusion Matrix allows for more detailed analysis than mere proportion of correct guesses. For instance 177 loans from paid loans were incorrectly predicted as unpaid.
Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (177 loans + 31,594 loans) and the total number of incorrect predictions is (177 loans + 8,920 loans). The confusion matrix provides the information needed to determine how well a classification model performs. The performance metric, accuracy, summarizes this information with a single number .777 Accuracy takes the total number of correct predictions and divides it by the total number of all predictions made.
Mapping Paid and Unpaid Loans
The above map is referred to as the choropleth map, "a thematic map in which areas are shade patterned in proportion to the measurement of the statistical variable being displayed." (wikipedia)
As the intensity of the color increases (gets closer to 1), on average the majority of the people residing in that state have paid of their loan.
The number near the point references the amount of loans given in that state.
By the looks of the map Nebraska, Missouri, Oregon, Virginia, Montana, Wyoming and South Dakota are not the states that are too fortunate in repaying their loans.
Of course this an average of individual loans, per state, discounting specific regions of the state, and is not the best estimate for whether a funded individual in that state is likely to repay their loan.
However, maybe the other features could help determine which state is less likelier to pay off a loan.
Mapping Amount Funded
Understanding that as the amount funded increases so does the chance of the loan not being paid back, we could see that Mississippi is a state with a fairly large funded amount. Mississippi is also a state, according to the map on loan status, a state that doesn’t do too well in repaying their loans.
On average, individuals receiving a loan in Mississippi are much more likelier to default on their loan as they are also likelier to receive bigger loans. Lets look further.
Mapping Annual Income
There are several outliers in the data that have been removed, in terms of annual income. Before removing the outliers, the income ranges from $33,504.72 to $7,241,778. Which is an obscene amount. I limit it to $200,000.00. The map ranges reflects the annual income up to $120,000. Interestingly, Mississippi is the state with an average income, between 60k – 80k with the lowest payback rate and on average the state that takes out the highest loans.
Mapping The Grade Assigned to Individual Loans Keeping on track with Mississippi, a state I'm not too familiar with, it also happens to have a terrible rating for loans according to the data. I could understand why Lending Club, on average, would give a pretty poor grade to loans in Oregon. The average population there a fairly good income, but I guess it’s not too predictive of a good grade. We could see that by looking at the income map presented before.
Mapping Employment Length Mississippi appears to have fairly good employment. It doesn’t appear to be too predictive of their faulty loans. Conclusion: Avoid Mississippi. Wish I could go further into this. Don’t give a loan to someone that doesn’t know his or her homeownership status. Lending Club data download site: https://www.lendingclub.com/info/download-‐data.action