oleh dubno lending club loan data -...

10
Predicting Defaults of Loans using Lending Club’s Loan Data Oleh Dubno Fall 2014 General Assembly – Data Science Link to my Developer Notebook (ipynb) http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246 Background and Hypothesis: The data is coming from Lending Club, a peertopeer lending company, headquartered in San Francisco. LC began by operating as an online consumerlending platform that enables borrowers to obtain a loan that’s funded by individuals and institutions. LC, just recently made their loans available to small businesses. I will be focusing on the prior. The 2007 2011 dataset and the associated description of its features are downloadable on the LC site. It comes equipped with 188,127 values and 31 features. Goal: Discover the features that are indicative of someone paying or defaulting on their loan. Tools: Logistic regression, Naïve Bayes, Decision Tree To determine which features of the data set contribute towards someone repaying or defaulting on his or her loan and using the Decision Tree to see how well the model performs against a test set. Folium To map the features of the dataset. By initially mapping a bar chart of the loan statuses, seven unique values become discoverable. To do the logistic regression only two are required. (see figure below) The focus is around predicting who repays or defaults on their loan. As a result, the “Current” column will be removed, the “Fully Paid” column will remain and the rest of the columns will be grouped and characterized as “Unpaid”. This is then converted to Boolean values: Unpaid 0 and Paid 1.

Upload: others

Post on 04-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

Predicting  Defaults  of  Loans  using  Lending  Club’s  Loan  Data    Oleh  Dubno  Fall  2014  General  Assembly  –  Data  Science    Link  to  my  Developer  Notebook  (ipynb)  -­‐  http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246    Background  and  Hypothesis:  The  data  is  coming  from  Lending  Club,  a  peer-­‐to-­‐peer  lending  company,  headquartered  in  San  Francisco.  LC  began  by  operating  as  an  online  consumer-­‐lending  platform  that  enables  borrowers  to  obtain  a  loan  that’s  funded  by  individuals  and  institutions.  LC,  just  recently  made  their  loans  available  to  small  businesses.  I  will  be  focusing  on  the  prior.    The  2007  -­‐  2011  dataset  and  the  associated  description  of  its  features  are  downloadable  on  the  LC  site.    It  comes  equipped  with  188,127  values  and  31  features.        Goal:    Discover  the  features  that  are  indicative  of  someone  paying  or  defaulting  on  their  loan.      Tools:    Logistic  regression,  Naïve  Bayes,  Decision  Tree    To  determine  which  features  of  the  data  set  contribute  towards  someone  repaying  or  defaulting  on  his  or  her  loan  and  using  the  Decision  Tree  to  see  how  well  the  model  performs  against  a  test  set.    Folium  To  map  the  features  of  the  dataset.  

 By  initially  mapping  a  bar  chart  of  the  loan  statuses,  seven  unique  values  become  discoverable.  To  do  the  logistic  regression  only  two  are  required.  (see  figure  below)    The  focus  is  around  predicting  who  repays  or  defaults  on  their  loan.  As  a  result,  the  “Current”  column  will  be  removed,  the  “Fully  Paid”  column  will  remain  and  the  rest  of  the  columns  will  be  grouped  and  characterized  as  “Unpaid”.  This  is  then  converted  to  Boolean  values:  Unpaid  0  and  Paid  1.  

 

 

Page 2: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

 The  data  has  now  been  drastically  reduced.  Given  that  “Current”  is  a  heavy  hitter,  removing  it  reduces  the  dataset  to  54,419  entries.  This  is  necessary,  provided  the  goal  is  not  to  focus  on  current  loans.      Data  Overview  The  average  funded  amount  of  an  individual  loan  is  $13,924.27.  The  minimum  loan  given  out  is  $1,00.00  with  a  median  amount  of  $12,000  and  a  maximum  amount  of  just  $35,000.00.  The  funded  amount  is  normally  distributed  and  the  numbers  do  not  appear  to  be  skewed.  Good!        

 The  average  annual  income  is  $71,833.82  with  a  minimum  income  of  $4,800,  a  median  of  $62,000  and  a  maximum  income  of  $7,141,778.  The  maximum  value  serves  as  a  definite  outlier  and  the  set  will  be  limited  to  $200,000.        Not  surprisingly,  as  Annual  Income  goes  up  so  does  the  Funded  Amount.  The  sweet  spot,  after  which  Annual  Income  does  not  predict  Funded  Amount,  seems  to  be  at  about  the  mean  of  the  annual  income  itself  of  $72,000.          

I  suppose  the  mean  annual  income  of  $72,000  matches  the  cut  off  for  loans  at  $35,000  for  good  reasons.  Interestingly,  Lending  club  seems  to  have  a  strict  policy,  limiting  the  Amount  Funded  according  to  the  individual  Annual  Income,  up  to  $72,000,  after  which  it  begins  to  vary.    

Page 3: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

   Lets  run  an  OLS  regression  using  Annual  Income  (predictor)  to  predict  Amount  Funded  (the  explained  variable).    OLS  (Ordinary  Least  Squares)  attempts  to  predict  the  dependent  variable,  Amount  Funded,  using  the  independent  variable  Annual  Income.  The  regression  algorithm  “learns”  from  this  data  to  predict  the  right  Amount  Funded  given  the  Annual  Income.      The  OLS  regression  with  Annual  Income  is  set  to  predict  Amount  Funded  (limiting  the  dataset  to  income  <=  $200,000)  shows  an  R^2  of  .201      This  means  that  20%  of  the  variance  in  Funded  Amount  is  explained  by  Annual  Income.  This,  however,  is  a  low  R^2.  With  the  assistance  of  the  scatter  plot,  we  do  see  that  Annual  Income  is  suggestive  in  determining  the  Funded  Amount  only  up  until  the  Annual  Income  of  $72,000.      Logistic  Regression  Next:  4  Logistic  Regressions  Determining  Loan  Status    The  first  logistic  regression  is  using  the  time  of  employment  and  the  grade  that  the  loan  received  from  LC  to  predict  loan  status.    Below  is  a  chart  highlighting  the  coefficients.  Coefficients  represent  the  mean  change  in  the  response  variable  for  one  unit  of  change  in  the  predictor  variable.  In  other  words,  a  1  year  increase  in  employment  length  increases  the  chance  of  the  loan  being  paid  back  by  0.016.  A  2  year  increase  in  employment  length  increases  the  chance  of  the  loan  being  paid  back  by  0.0320,  and  so  on.    

It  would  be  interesting  to  see  how  effective  the  grade,  that  LC  provides  their  loans,  is  at  predicting  loan  status.      Some  background.  The  provided  grades  range  from  “A  –  G”:  “A”  being  the  highest  and  “G”  the  lowest.  As  a  result  I  mapped  “7”,  the  highest  

value,  to  “A”,  “6”  to  “B”,  “5”  to  C”  and  so  on  until  “1”  as  “G”.        As  the  grade  increases  by  1  grade  value  the  chance  of  the  loan  being  paid  off  increases  by  0.31.      Given  we’re  using  binary  output  of  0  as  unpaid  and  1  as  paid.  The  closer  the  multiple  of  the  grade  and  the  coefficient  is  to  1  the  higher  the  likelihood  of  the  loan  being  paid  off.  Pretty  much,  if  the  grade  is  “E”  or  “3”  the  chance  of  payback  is  very  high.            

Page 4: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

The  second  logistic  regression  is  using  funded  amount  and  annual  income  to  predict  loan  status.    The  reason  for  such  low  coefficients,  for  funded  amount  and  annual  income,  is  that  the  numbers  are  in  thousands,  granted  they're  in  dollar  amounts,  and  the  explained  variable,  loan  status,  is  binary  ranging  from  0  to  1.      

Let's  look  at  the  amount  funded.  As  the  amount  funded  increases  by  $10,000  the  chance  of  it  getting  paid  back  decreases  by  -­‐0.238  =  (10,000  x  -­‐0.0000238).    Similar,  as  annual  income  increases  so  does  the  chance  of  the  loan  being  paid  off.  Intuitive,  right?  

This  is  understandable  and  supported  by  the  positive  coefficient  0.0000202.  In  other  words  as  the  annual  income  increases  by  $10,000  so  does  the  chance  of  the  loan  being  paid  back  by  0.230  (10,000  x  0.0000230)    The  third  logistic  regression  is  using  home  ownership  status  (Rent,  Mortgage,  Own,  None,  Other)  to  predict  loan  status.  

 My  understanding  for  someone  putting  “OTHER”  for  home  ownership  on  the  loan  application  is  that  they  either  did  not  want  to  reveal  their  home  ownership  situation,  are  hiding  something,  or  are  bad  at  filling  out  applications.  “None”  could  be  an  honest  answer,  from  someone  that  may  be  living  with  their  parents.      Regardless,  it  seems  that  if  someone  checks  off  “OTHER”  and  gets  funded,  then  there’s  a  very  good  chance  of  that  individual  defaulting  on  his  or  her  loan.      

                           

Page 5: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

The  fourth  logistic  regression  is  using  employment  length  (<1  year  –  10+  years)  to  predict  loan  status.  

 There  doesn’t  immediately  appear  to  be  too  much  variance  between  the  generated  coefficients  of  years  employed.  It  looks  like;  so  long  as  the  person  is  employed  they  will  be  paying  back  their  loan.    However,  it  holds  true,  that  if  someone  is  unemployed  or  has  less  than  a  year  of  employment  then  they’ll  have  a  lower  chance  of  repaying  their  loan.  I  didn’t  investigate  which  percentage  of  “<1  year”  is  employed  or  unemployed.    Interestingly,  and  probably  just  a  coincidence,  because  the  results  are  really  marginal,  if  a  person  is  employed  for  4  years  they  have  the  same  coefficient  of  paying  back  their  loan  as  someone  employed  for  one  year  or  less.    Just  an  observation.  I  will  not  be  pursuing  that  point  any  further.      To  conclude  the  work  on  logistic  regression:  the  data  set  is  deficient  in  explored  features  that  I  lacked,  in  experience  leveraged  with  time,  to  explore.      From  the  findings  that  I  got,  I  can’t  speak  definitively,  but  I  would  say  avoid  giving  loans  to  people  that  don’t  specify  home  ownership  and  do  give  loans  to  people  with  higher  income.  

 Decision  Tree  and  The  Confusion  Matrix    

Confusion  Matrix  allows  for  more  detailed  analysis  than  mere  proportion  of  correct  guesses.    For  instance  177  loans  from  paid  loans  were  incorrectly  predicted  as  unpaid.    

 Based  on  the  entries  in  the  confusion  matrix,  the  total  number  of  correct  predictions  made  by  the  model  is  (177  loans  +  31,594  loans)  and  the  total  number  of  incorrect  predictions  is  (177  loans  +  8,920  loans).    The  confusion  matrix  provides  the  information  needed  to  determine  how  well  a  classification  model  performs.  The  performance  metric,  accuracy,  summarizes  this  information  with  a  single  number  .777      Accuracy  takes  the  total  number  of  correct  predictions  and  divides  it  by  the  total  number  of  all  predictions  made.    

Page 6: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

Mapping  Paid  and  Unpaid  Loans  

   The  above  map  is  referred  to  as  the  choropleth  map,  "a  thematic  map  in  which  areas  are  shade  patterned  in  proportion  to  the  measurement  of  the  statistical  variable  being  displayed."  (wikipedia)  

As  the  intensity  of  the  color  increases  (gets  closer  to  1),  on  average  the  majority  of  the  people  residing  in  that  state  have  paid  of  their  loan.  

The  number  near  the  point  references  the  amount  of  loans  given  in  that  state.  

By  the  looks  of  the  map  Nebraska,  Missouri,  Oregon,  Virginia,  Montana,  Wyoming  and  South  Dakota  are  not  the  states  that  are  too  fortunate  in  repaying  their  loans.  

Of  course  this  an  average  of  individual  loans,  per  state,  discounting  specific  regions  of  the  state,  and  is  not  the  best  estimate  for  whether  a  funded  individual  in  that  state  is  likely  to  repay  their  loan.  

However,  maybe  the  other  features  could  help  determine  which  state  is  less  likelier  to  pay  off  a  loan.              

Page 7: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

Mapping  Amount  Funded    

 Understanding  that  as  the  amount  funded  increases  so  does  the  chance  of  the  loan  not  being  paid  back,  we   could   see   that  Mississippi   is   a   state  with  a   fairly   large   funded  amount.  Mississippi   is   also  a   state,  according  to  the  map  on  loan  status,  a  state  that  doesn’t  do  too  well  in  repaying  their  loans.    

On  average,  individuals  receiving  a  loan  in  Mississippi  are  much  more  likelier  to  default  on  their  loan  as  they  are  also  likelier  to  receive  bigger  loans.    Lets  look  further.                              

Page 8: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

Mapping  Annual  Income    

   There  are  several  outliers  in  the  data  that  have  been  removed,  in  terms  of  annual  income.      Before  removing  the  outliers,  the  income  ranges  from  $33,504.72  to  $7,241,778.  Which  is  an  obscene  amount.  I  limit  it  to  $200,000.00.  The  map  ranges  reflects  the  annual  income  up  to  $120,000.    Interestingly,  Mississippi  is  the  state  with  an  average  income,  between  60k  –  80k  with  the  lowest  payback  rate  and  on  average  the  state  that  takes  out  the  highest  loans.                                  

Page 9: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

Mapping  The  Grade  Assigned  to  Individual  Loans                                              Keeping  on  track  with  Mississippi,  a  state  I'm  not  too  familiar  with,  it  also  happens  to  have  a  terrible  rating  for  loans  according  to  the  data.      I  could  understand  why  Lending  Club,  on  average,  would  give  a  pretty  poor  grade  to  loans  in  Oregon.  The  average  population  there  a  fairly  good  income,  but  I  guess  it’s  not  too  predictive  of  a  good  grade.  We  could  see  that  by  looking  at  the  income  map  presented  before.                                    

Page 10: Oleh Dubno Lending Club Loan Data - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/...Thedata!has!now!been!drastically!reduced.!Given!that!“Current”!is!a!heavy!hitter,!removing!it!reduces!

Mapping  Employment  Length                                                  Mississippi  appears  to  have  fairly  good  employment.  It  doesn’t  appear  to  be  too  predictive  of  their  faulty  loans.    Conclusion:  Avoid  Mississippi.  Wish  I  could  go  further  into  this.    Don’t  give  a  loan  to  someone  that  doesn’t  know  his  or  her  homeownership  status.      Lending  Club  data  download  site:  https://www.lendingclub.com/info/download-­‐data.action