data-driven investment strategies for peer-to-peer lending...

23
ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending: A Case Study for Teaching Data Science Maxime C. Cohen, 1, * C. Daniel Guetta, 2 Kevin Jiao, 1 and Foster Provost 1 Abstract We develop a number of data-driven investment strategies that demonstrate how machine learning and data an- alytics can be used to guide investments in peer-to-peer loans. We detail the process starting with the acquisition of (real) data from a peer-to-peer lending platform all the way to the development and evaluation of investment strat- egies based on a variety of approaches. We focus heavily on how to apply and evaluate the data science methods, and resulting strategies, in a real-world business setting. The material presented in this article can be used by instruc- tors who teach data science courses, at the undergraduate or graduate levels. Importantly, we go beyond just eval- uating predictive performance of models, to assess how well the strategies would actually perform, using real, publicly available data. Our treatment is comprehensive and ranges from qualitative to technical, but is also mod- ular—which gives instructors the flexibility to focus on specific parts of the case, depending on the topics they want to cover. The learning concepts include the following: data cleaning and ingestion, classification/probability estima- tion modeling, regression modeling, analytical engineering, calibration curves, data leakage, evaluation of model performance, basic portfolio optimization, evaluation of investment strategies, and using Python for data science. Keywords: data science; machine learning; teaching; peer-to-peer lending Goals and Structure The main goal of this article is to present a comprehen- sive case study based on a real-world application that can be used in the context of a data science course.* Teaching data science and machine learning concepts based on a concrete business problem makes the learn- ing process more real, relevant, and exciting, and allows students to understand directly the applicability of the material. While case studies are common practice in many business disciplines (e.g., Marketing and Strategy), it is difficult to find comprehensive case studies to sup- port data science courses, especially cases that span from raw data to actual business outcomes. This article helps to fill this gap. We hope that other authors will continue this effort and develop additional comprehensive teach- ing cases tailored to data science courses. This is espe- cially important as these topics are increasingly taught to less technical student populations, such as MBA stu- dents, who tend to be skeptical of contrived case studies without strong roots in a real business problem. We chose online peer-to-peer lending as our appli- cation for several reasons. First, this is an application area that most people can easily relate to. Second, be- yond the use of predictive models, we can also develop 1 Information, Operations, and Management Sciences, NYU Stern School of Business, New York, New York. 2 Decision, Risk, and Operations Division, Columbia Business School, New York, New York. The authors are listed in alphabetical order. *Address correspondence to: Maxime C. Cohen, Information, Operations, and Management Sciences, NYU Stern School of Business, New York, NY 10012, E-mail: [email protected] ª Maxime C. Cohen et al., 2018; Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons Attribution Noncommercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and the source are cited. *The content of this article reflects only one approach to the problem. The investment strategies and results obtained are by no means the only way to solve the problem at hand, with the goal of providing material for educational purposes. A companion website for this case at guetta.com/lc_case contains ad- ditional teaching material and the Jupyter notebooks that accompany this case. Big Data Volume 6, Number 3, 2018 Mary Ann Liebert, Inc. DOI: 10.1089/big.2018.0092 191

Upload: others

Post on 27-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

ORIGINAL ARTICLE

Data-Driven Investment Strategiesfor Peer-to-Peer LendingA Case Study for Teaching Data ScienceMaxime C Cohen1 C Daniel Guetta2 Kevin Jiao1 and Foster Provost1

AbstractWe develop a number of data-driven investment strategies that demonstrate how machine learning and data an-alytics can be used to guide investments in peer-to-peer loans We detail the process starting with the acquisition of(real) data from a peer-to-peer lending platform all the way to the development and evaluation of investment strat-egies based on a variety of approaches We focus heavily on how to apply and evaluate the data science methodsand resulting strategies in a real-world business setting The material presented in this article can be used by instruc-tors who teach data science courses at the undergraduate or graduate levels Importantly we go beyond just eval-uating predictive performance of models to assess how well the strategies would actually perform using realpublicly available data Our treatment is comprehensive and ranges from qualitative to technical but is also mod-ularmdashwhich gives instructors the flexibility to focus on specific parts of the case depending on the topics they wantto cover The learning concepts include the following data cleaning and ingestion classificationprobability estima-tion modeling regression modeling analytical engineering calibration curves data leakage evaluation of modelperformance basic portfolio optimization evaluation of investment strategies and using Python for data science

Keywords data science machine learning teaching peer-to-peer lending

Goals and StructureThe main goal of this article is to present a comprehen-sive case study based on a real-world application thatcan be used in the context of a data science courseTeaching data science and machine learning conceptsbased on a concrete business problem makes the learn-ing process more real relevant and exciting and allowsstudents to understand directly the applicability of thematerial While case studies are common practice inmany business disciplines (eg Marketing and Strategy)

it is difficult to find comprehensive case studies to sup-port data science courses especially cases that span fromraw data to actual business outcomes This article helpsto fill this gap We hope that other authors will continuethis effort and develop additional comprehensive teach-ing cases tailored to data science courses This is espe-cially important as these topics are increasingly taughtto less technical student populations such as MBA stu-dents who tend to be skeptical of contrived case studieswithout strong roots in a real business problem

We chose online peer-to-peer lending as our appli-cation for several reasons First this is an applicationarea that most people can easily relate to Second be-yond the use of predictive models we can also develop

1Information Operations and Management Sciences NYU Stern School of Business New York New York2Decision Risk and Operations Division Columbia Business School New York New YorkThe authors are listed in alphabetical order

Address correspondence to Maxime C Cohen Information Operations and Management Sciences NYU Stern School of Business New York NY 10012 E-mailmaxcohennyuedu

ordf Maxime C Cohen et al 2018 Published by Mary Ann Liebert Inc This Open Access article is distributed under the terms of the Creative Commons AttributionNoncommercial License (httpcreativecommonsorglicensesby-nc40) which permits any noncommercial use distribution and reproduction in any mediumprovided the original author(s) and the source are cited

The content of this article reflects only one approach to the problem Theinvestment strategies and results obtained are by no means the only way tosolve the problem at hand with the goal of providing material for educationalpurposes A companion website for this case at guettacomlc_case contains ad-ditional teaching material and the Jupyter notebooks that accompany this case

Big DataVolume 6 Number 3 2018Mary Ann Liebert IncDOI 101089big20180092

191

prescriptive tools (in this context investment strate-gies) so we can directly observe the potential businessimpact of our models Third as discussed in theStory Line section the largest two online US platformsmake their data publicly available allowing readers toeasily reproduce and extend our results

This article is structured as follows In the StoryLine section we present the story and backgroundused in our case study This part motivates the con-crete business problem and can be assigned to stu-dents before the first class The Questions andTeaching Material section lists a series of questionsthat can be used as a basis for in-class discussionand provides solutions and teaching notes We divideour treatment into six parts Introduction and Objec-tives Data Ingestion and Cleaning Data ExplorationPredictive Models for Default Investment Strategiesand Optimization The analysis is comprehensive itdescribes the entire beginning-to-end process thatincludes several important concepts such as workingwith real-data predictive modeling machine learningand optimization Depending on the needs of the in-structor and the focus of the course these parts canbe used independently of each other as neededFinally the details of the data used in this case studyand a brief description of the structure of the supple-mental Jupyter notebooks containing the code are rel-egated to the Appendix

This case is intended to complement other teachingresources for data science classes We do not attempt toteach all the fundamental concepts and algorithms norimplementation details For those not familiar with allthe concepts in the case just about all of the data sci-ence concepts including their relationship to similarbusiness problems are covered in Provost and Faw-cett1 Using Python for data science and machine learn-ing is covered in McKinney2 and in Raschka andMirjalili3 Finally a comprehensive introduction tothe theory and practice of convex optimization canbe found in Boyd and Vandenberghe4

Story LineIn this case we follow Jasmin Gonzales a young pro-fessional looking to diversify her investment portfolio

Jasmin graduated with a Masters in Data Science andafter four successful years as a product manager in atech company she has managed to save a sizable

amount of money She now wants to start diversifyingher savings portfolio So far she has focused on tradi-tional investments (stocks bonds etc) and she nowwants to look further afield

One asset class she is particularly interested in ispeer-to-peer loans issued on online platforms Thehigh returns advertised by these platforms seem to bean attractive value proposition and Jasmin is especiallyexcited by the large amount of data these platformsmake publicly available With her data science back-ground she is hoping to apply machine learningtools to these data to come up with lucrative invest-ment strategies In this case we follow Jasmin as shedevelops such an investment strategy

Background on peer-to-peer lendingPeer-to-peer lending refers to the practice of lendingmoney to individuals (or small businesses) via onlineservices that match anonymous lenders with borrow-ers Lenders can typically earn higher returns relativeto savings and investment products offered by bankinginstitutions However there is of course the risk thatthe borrower defaults on his or her loan

Interest rates are usually set by an intermediary plat-form on the basis of analyzing the borrowerrsquos credit(using features such as FICO score employment statusannual income debt-to-income ratio number of opencredit lines) The intermediary platform generates rev-enue by collecting a one-time fee on funded loans(from borrowers) and by charging a loan servicingfee to investors

The peer-to-peer lending industry in the UnitedStates started in February 2006 with the launch of Pros-per followed by LendingClubx In 2008 the Securitiesand Exchange Commission (SEC) required that peer-to-peer companies register their offerings as securitiespursuant to the Securities Act of 1933 Both Prosperand LendingClub gained approval from the SEC tooffer investors notes backed by payments received onthe loans

By June 2012 LendingClub was the largest peer-to-peer lender in the United States based on issued loanvolume and revenue followed by Prosper In Decem-ber 2015 LendingClub reported that $1598 billion inloans had been originated through its platform Withvery high year-over-year growth peer-to-peer lendinghas been one of the fastest growing investments

The story used in this case is fictitious and was chosen to illustrate a real-worldsituation Consequently any details herein bearing resemblance to real peopleor events are purely coincidental

httpswwwprospercomxhttpswwwlendingclubcomhttpsenwikipediaorgwikiLending_Club

192 COHEN ET AL

According to InvestmentZen as of May 2017 the inter-est rates range from 67 to 228 depending on theloan term and the rating of the borrower and defaultrates vary between 13 and 106

LendingClub issues loans between $1000 and$40000 for a duration of either 36 or 60 months Asmentioned the interest rates for borrowers are deter-mined based on personal information such as creditscore and annual income A screenshot of the Lending-Club home page is shown in Figure 1 In additionLendingClub categorizes its loans using a gradingscheme (grades A B C D E F and G where gradeA corresponds to the loans judged to be lsquolsquosafestrsquorsquo byLendingClub) Individual investors can browse loanlistings online before deciding which loans(s) to investin (Fig 2) Each loan is split into multiples of $25called notes (eg for a $2000 loan there will be 80notes of $25 each) Investors can obtain more detailedinformation on each loan by clicking on the loanmdashFigure 3 shows an example of the additional infor-mation available for a given loan Investors can thenpurchase these notes in a similar manner to lsquolsquosharesrsquorsquoof a stock in an equity market Of course the saferthe loan the lower the interest rate and so investorshave to balance risk and return when deciding whichloans to invest in

One of the interesting features of the peer-to-peerlending market is the richness of the historical data

available The two largest US platforms (LendingCluband Prosper) have chosen to give free access to theirdata to potential investors This raises a whole host ofquestions for investors such as Jasmin

Are these data valuable when selecting loans to in-vest in How could an investor use these data to de-

velop tools to guide investment decisions What is the impact of using data-driven tools on

the portfolio performance relative to ad hoc in-vestment strategies What average returns can an investor expect from

informed investments in peer-to-peer loans

The goal of this case study is to provide answers tothe questions above In particular we investigate howdata analytics and machine learning tools can be usedin the context of peer-to-peer lending investmentsWe use the historical data from loans that were issuedon LendingClub between January 2009 and November2017

Data sets and descriptive statisticsAs mentioned the data sets from LendingClub (andProsper) are publicly available online These datasets contain comprehensive information on all loansissued between 2007 and the third quarter of 2017 (anew updated data set is made available every quarter)

FIG 1 Screenshot of the LendingClub home page

wwwinvestmentzencompeer-to-peer-lending-for-investorslendingclub (lastaccessed June 2018)

The analysis in this case study focuses on the LendingClub data However asimilar analysis could be conducted using Prosper data

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 193

The data set includes hundreds of features includingthe following for each loan

1 Interest rate2 Loan amount3 Monthly installment amount4 Loan status (eg fully paid default charged-off)5 Several additional attributes related to the bor-

rower such as type of house ownership annualincome monthly FICO score debt-to-incomeratio and number of open credit lines

The data set used in this case study contains morethan 750000 loan listings with a total value exceeding$107 billion In this data set 998 of the loans werefully funded (at LendingClub partially funded loansare issued only if the borrower agrees to receive a partialloan) Note that there is a significantly larger number oflistings starting from 2016 relative to previous years

The definition of each loan status is summarized inTable 1 Current refers to a loan that is still being reim-

bursed in a timely manner Late corresponds to a loanon which a payment is between 16 and 120 days over-due If the payment is delayed by more than 121 daysthe loan is considered to be in Default If LendingClubhas decided that the loan will not be paid off then it isgiven the status of Charged-Offxx

These dynamics imply that 5 months after the termof each loan has ended every loan ends in one of twoLendingClub statesmdashfully paid or charged-off Wecall these two statuses fully paid and defaulted respec-tively and we refer to a loan that has reached one ofthese statuses as expired

One way to simplify the problem is to consider onlyloans that have expired at the time of analysis For ex-ample for an analysis carried out in April 2018 thisimplies looking at all 36-month loans issued on or

FIG 2 Example of loan listings (source LendingClub website date accessed May 2018)

xxNote that sometimes the lsquolsquoCharged-Offrsquorsquo status will occur before lsquolsquoDefaultrsquorsquo ifwhenthe borrower has filed bankruptcy or has notified the intermediary platformFor example if a borrower defaults on a loan in the last month of a 36-monthloan it would take another 5 months for the loan to be charged-off

194 COHEN ET AL

before October 31 2014 and all 60-month loans issuedon or before October 31 2012

As illustrated in Figure 4 a significant portion(135) of loans ended in Default status dependingon how much of the loan was paid back these loansmight have resulted in a significant loss to investorswho had invested in them The remainder was FullyPaidmdashthe borrower fully reimbursed the loanrsquos out-

standing balance with interest and the investor earneda positive return on his or her investment Therefore toavoid unsuccessful investments our goal is to estimatewhich loans are more (resp less) likely to default andwhich will yield low (resp high) returns To addressthis question we investigate several machine learningtools and show how one can use historical data to con-struct informed investment strategies

Investment strategies and portfolio constructionMaking predictions and constructing a portfolio in thecontext of online peer-to-peer lending can be challeng-ing The volume of data available provides an opportu-nity to develop sophisticated data-driven methods Inpractice an investor such as Jasmin would seek to con-struct a portfolio with the highest possible return

FIG 3 Example of a detailed loan listing for a grade A loan Information available to investors includes thelength of the borrowerrsquos employment the borrowerrsquos credit score range and their gross income among others(source LendingClub website date accessed July 2018)

Table 1 Loan statuses in LendingClub

Number of days past due Status

0 Current16ndash120 Late121ndash150 Default150+ Charged-off

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 195

subject to constraints imposed by her risk tolerancebudget and diversification requirements (eg nomore than 25 of loans with grades E or F) In thiscase study we investigate the extent to which usingpredictive models can increase portfolio perfor-mance

On important thing that Jasmin will encounter as inmany real applications of predictive analytics is that itis far from trivial to progress from building a predictivemodel to using the model to make intelligent decisionsIn her prior classes Jasminrsquos exercises often ended withestimating the predictive ability of models on out-of-sample data She will do that here as wellmdashbut then shewill have to figure out how to estimate the return to ex-pect from an investment She will find that even with aseemingly good predictive model in hand estimatingthe return of an investment requires additional analysis

Questions and Teaching MaterialIn this section we present a number of teaching planseach of which introduces certain data science conceptsin the context of the case study presented in the StoryLine section We divide our analysis into six parts Intro-duction and Objectives Data Ingestion and CleaningData Exploration Predictive Models for Default Invest-ment Strategies and Optimization Each part is self-contained and includes several questions that could beassigned to students Each question is followed by detailedexplanations and pointers for class discussions Severalparts of this section refer to supplementary Jupyternotebooks that can be obtained from the companionwebsite for this case at guettacomlc_case together

with additional teaching materials The details of thedata used in this case study and a brief description of thestructure of the notebooks are provided in the Appen-dix

Part I Introduction and objectivesIn this part of the case study we begin by taking stockof Jasminrsquos problem and create a framework we willlater use to solve it

1 Fundamentally what decisions will Jasmin needto make

Solution We begin with this question to em-phasize the importance of grounding any studyof a data set in reality Indeed the best way towork with a data set will strongly depend onwhat our ultimate goal is

Arguably there are two decisions Jasmin mightneed to make here

She will first need to decide how much of hermoney to invest in LendingClub and howmuch to allocate to other options for investmentThis would of course also require data about herother options We are not considering this deci-sion in this case

Then once she has decided how much to investin LendingClub she will need to decide the exactloans in which to invest her budget This is the de-cision we focus on

We note that depending on which of the twodecisions Jasmin needs to make her data require-ments will be different as will the techniques shemight use

2 What is Jasminrsquos objective when making these de-cisions How will she be able to distinguish lsquolsquobet-terrsquorsquo decisions from lsquolsquoworsersquorsquo ones

Solution For this case we consider this prob-lem to have a clear and simple objectivemdashto makeas much money as possible

A discussion here should begin with a generaldescription of what this kind of performanceevaluation would look like Conceptually this isnot too difficultmdashJasmin should split her datainto two parts She should use the first part tomake her decisions and the second to evaluatethem Specifically her decisions will be whichloans in the second part to invest in and evaluat-ing her decision will require looking at the actualoutcome of those loans and seeing how muchmoney they returned

FIG 4 Proportion of Fully Paid versus Defaultfor terminated loans (by November 2017)

196 COHEN ET AL

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 2: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

prescriptive tools (in this context investment strate-gies) so we can directly observe the potential businessimpact of our models Third as discussed in theStory Line section the largest two online US platformsmake their data publicly available allowing readers toeasily reproduce and extend our results

This article is structured as follows In the StoryLine section we present the story and backgroundused in our case study This part motivates the con-crete business problem and can be assigned to stu-dents before the first class The Questions andTeaching Material section lists a series of questionsthat can be used as a basis for in-class discussionand provides solutions and teaching notes We divideour treatment into six parts Introduction and Objec-tives Data Ingestion and Cleaning Data ExplorationPredictive Models for Default Investment Strategiesand Optimization The analysis is comprehensive itdescribes the entire beginning-to-end process thatincludes several important concepts such as workingwith real-data predictive modeling machine learningand optimization Depending on the needs of the in-structor and the focus of the course these parts canbe used independently of each other as neededFinally the details of the data used in this case studyand a brief description of the structure of the supple-mental Jupyter notebooks containing the code are rel-egated to the Appendix

This case is intended to complement other teachingresources for data science classes We do not attempt toteach all the fundamental concepts and algorithms norimplementation details For those not familiar with allthe concepts in the case just about all of the data sci-ence concepts including their relationship to similarbusiness problems are covered in Provost and Faw-cett1 Using Python for data science and machine learn-ing is covered in McKinney2 and in Raschka andMirjalili3 Finally a comprehensive introduction tothe theory and practice of convex optimization canbe found in Boyd and Vandenberghe4

Story LineIn this case we follow Jasmin Gonzales a young pro-fessional looking to diversify her investment portfolio

Jasmin graduated with a Masters in Data Science andafter four successful years as a product manager in atech company she has managed to save a sizable

amount of money She now wants to start diversifyingher savings portfolio So far she has focused on tradi-tional investments (stocks bonds etc) and she nowwants to look further afield

One asset class she is particularly interested in ispeer-to-peer loans issued on online platforms Thehigh returns advertised by these platforms seem to bean attractive value proposition and Jasmin is especiallyexcited by the large amount of data these platformsmake publicly available With her data science back-ground she is hoping to apply machine learningtools to these data to come up with lucrative invest-ment strategies In this case we follow Jasmin as shedevelops such an investment strategy

Background on peer-to-peer lendingPeer-to-peer lending refers to the practice of lendingmoney to individuals (or small businesses) via onlineservices that match anonymous lenders with borrow-ers Lenders can typically earn higher returns relativeto savings and investment products offered by bankinginstitutions However there is of course the risk thatthe borrower defaults on his or her loan

Interest rates are usually set by an intermediary plat-form on the basis of analyzing the borrowerrsquos credit(using features such as FICO score employment statusannual income debt-to-income ratio number of opencredit lines) The intermediary platform generates rev-enue by collecting a one-time fee on funded loans(from borrowers) and by charging a loan servicingfee to investors

The peer-to-peer lending industry in the UnitedStates started in February 2006 with the launch of Pros-per followed by LendingClubx In 2008 the Securitiesand Exchange Commission (SEC) required that peer-to-peer companies register their offerings as securitiespursuant to the Securities Act of 1933 Both Prosperand LendingClub gained approval from the SEC tooffer investors notes backed by payments received onthe loans

By June 2012 LendingClub was the largest peer-to-peer lender in the United States based on issued loanvolume and revenue followed by Prosper In Decem-ber 2015 LendingClub reported that $1598 billion inloans had been originated through its platform Withvery high year-over-year growth peer-to-peer lendinghas been one of the fastest growing investments

The story used in this case is fictitious and was chosen to illustrate a real-worldsituation Consequently any details herein bearing resemblance to real peopleor events are purely coincidental

httpswwwprospercomxhttpswwwlendingclubcomhttpsenwikipediaorgwikiLending_Club

192 COHEN ET AL

According to InvestmentZen as of May 2017 the inter-est rates range from 67 to 228 depending on theloan term and the rating of the borrower and defaultrates vary between 13 and 106

LendingClub issues loans between $1000 and$40000 for a duration of either 36 or 60 months Asmentioned the interest rates for borrowers are deter-mined based on personal information such as creditscore and annual income A screenshot of the Lending-Club home page is shown in Figure 1 In additionLendingClub categorizes its loans using a gradingscheme (grades A B C D E F and G where gradeA corresponds to the loans judged to be lsquolsquosafestrsquorsquo byLendingClub) Individual investors can browse loanlistings online before deciding which loans(s) to investin (Fig 2) Each loan is split into multiples of $25called notes (eg for a $2000 loan there will be 80notes of $25 each) Investors can obtain more detailedinformation on each loan by clicking on the loanmdashFigure 3 shows an example of the additional infor-mation available for a given loan Investors can thenpurchase these notes in a similar manner to lsquolsquosharesrsquorsquoof a stock in an equity market Of course the saferthe loan the lower the interest rate and so investorshave to balance risk and return when deciding whichloans to invest in

One of the interesting features of the peer-to-peerlending market is the richness of the historical data

available The two largest US platforms (LendingCluband Prosper) have chosen to give free access to theirdata to potential investors This raises a whole host ofquestions for investors such as Jasmin

Are these data valuable when selecting loans to in-vest in How could an investor use these data to de-

velop tools to guide investment decisions What is the impact of using data-driven tools on

the portfolio performance relative to ad hoc in-vestment strategies What average returns can an investor expect from

informed investments in peer-to-peer loans

The goal of this case study is to provide answers tothe questions above In particular we investigate howdata analytics and machine learning tools can be usedin the context of peer-to-peer lending investmentsWe use the historical data from loans that were issuedon LendingClub between January 2009 and November2017

Data sets and descriptive statisticsAs mentioned the data sets from LendingClub (andProsper) are publicly available online These datasets contain comprehensive information on all loansissued between 2007 and the third quarter of 2017 (anew updated data set is made available every quarter)

FIG 1 Screenshot of the LendingClub home page

wwwinvestmentzencompeer-to-peer-lending-for-investorslendingclub (lastaccessed June 2018)

The analysis in this case study focuses on the LendingClub data However asimilar analysis could be conducted using Prosper data

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 193

The data set includes hundreds of features includingthe following for each loan

1 Interest rate2 Loan amount3 Monthly installment amount4 Loan status (eg fully paid default charged-off)5 Several additional attributes related to the bor-

rower such as type of house ownership annualincome monthly FICO score debt-to-incomeratio and number of open credit lines

The data set used in this case study contains morethan 750000 loan listings with a total value exceeding$107 billion In this data set 998 of the loans werefully funded (at LendingClub partially funded loansare issued only if the borrower agrees to receive a partialloan) Note that there is a significantly larger number oflistings starting from 2016 relative to previous years

The definition of each loan status is summarized inTable 1 Current refers to a loan that is still being reim-

bursed in a timely manner Late corresponds to a loanon which a payment is between 16 and 120 days over-due If the payment is delayed by more than 121 daysthe loan is considered to be in Default If LendingClubhas decided that the loan will not be paid off then it isgiven the status of Charged-Offxx

These dynamics imply that 5 months after the termof each loan has ended every loan ends in one of twoLendingClub statesmdashfully paid or charged-off Wecall these two statuses fully paid and defaulted respec-tively and we refer to a loan that has reached one ofthese statuses as expired

One way to simplify the problem is to consider onlyloans that have expired at the time of analysis For ex-ample for an analysis carried out in April 2018 thisimplies looking at all 36-month loans issued on or

FIG 2 Example of loan listings (source LendingClub website date accessed May 2018)

xxNote that sometimes the lsquolsquoCharged-Offrsquorsquo status will occur before lsquolsquoDefaultrsquorsquo ifwhenthe borrower has filed bankruptcy or has notified the intermediary platformFor example if a borrower defaults on a loan in the last month of a 36-monthloan it would take another 5 months for the loan to be charged-off

194 COHEN ET AL

before October 31 2014 and all 60-month loans issuedon or before October 31 2012

As illustrated in Figure 4 a significant portion(135) of loans ended in Default status dependingon how much of the loan was paid back these loansmight have resulted in a significant loss to investorswho had invested in them The remainder was FullyPaidmdashthe borrower fully reimbursed the loanrsquos out-

standing balance with interest and the investor earneda positive return on his or her investment Therefore toavoid unsuccessful investments our goal is to estimatewhich loans are more (resp less) likely to default andwhich will yield low (resp high) returns To addressthis question we investigate several machine learningtools and show how one can use historical data to con-struct informed investment strategies

Investment strategies and portfolio constructionMaking predictions and constructing a portfolio in thecontext of online peer-to-peer lending can be challeng-ing The volume of data available provides an opportu-nity to develop sophisticated data-driven methods Inpractice an investor such as Jasmin would seek to con-struct a portfolio with the highest possible return

FIG 3 Example of a detailed loan listing for a grade A loan Information available to investors includes thelength of the borrowerrsquos employment the borrowerrsquos credit score range and their gross income among others(source LendingClub website date accessed July 2018)

Table 1 Loan statuses in LendingClub

Number of days past due Status

0 Current16ndash120 Late121ndash150 Default150+ Charged-off

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 195

subject to constraints imposed by her risk tolerancebudget and diversification requirements (eg nomore than 25 of loans with grades E or F) In thiscase study we investigate the extent to which usingpredictive models can increase portfolio perfor-mance

On important thing that Jasmin will encounter as inmany real applications of predictive analytics is that itis far from trivial to progress from building a predictivemodel to using the model to make intelligent decisionsIn her prior classes Jasminrsquos exercises often ended withestimating the predictive ability of models on out-of-sample data She will do that here as wellmdashbut then shewill have to figure out how to estimate the return to ex-pect from an investment She will find that even with aseemingly good predictive model in hand estimatingthe return of an investment requires additional analysis

Questions and Teaching MaterialIn this section we present a number of teaching planseach of which introduces certain data science conceptsin the context of the case study presented in the StoryLine section We divide our analysis into six parts Intro-duction and Objectives Data Ingestion and CleaningData Exploration Predictive Models for Default Invest-ment Strategies and Optimization Each part is self-contained and includes several questions that could beassigned to students Each question is followed by detailedexplanations and pointers for class discussions Severalparts of this section refer to supplementary Jupyternotebooks that can be obtained from the companionwebsite for this case at guettacomlc_case together

with additional teaching materials The details of thedata used in this case study and a brief description of thestructure of the notebooks are provided in the Appen-dix

Part I Introduction and objectivesIn this part of the case study we begin by taking stockof Jasminrsquos problem and create a framework we willlater use to solve it

1 Fundamentally what decisions will Jasmin needto make

Solution We begin with this question to em-phasize the importance of grounding any studyof a data set in reality Indeed the best way towork with a data set will strongly depend onwhat our ultimate goal is

Arguably there are two decisions Jasmin mightneed to make here

She will first need to decide how much of hermoney to invest in LendingClub and howmuch to allocate to other options for investmentThis would of course also require data about herother options We are not considering this deci-sion in this case

Then once she has decided how much to investin LendingClub she will need to decide the exactloans in which to invest her budget This is the de-cision we focus on

We note that depending on which of the twodecisions Jasmin needs to make her data require-ments will be different as will the techniques shemight use

2 What is Jasminrsquos objective when making these de-cisions How will she be able to distinguish lsquolsquobet-terrsquorsquo decisions from lsquolsquoworsersquorsquo ones

Solution For this case we consider this prob-lem to have a clear and simple objectivemdashto makeas much money as possible

A discussion here should begin with a generaldescription of what this kind of performanceevaluation would look like Conceptually this isnot too difficultmdashJasmin should split her datainto two parts She should use the first part tomake her decisions and the second to evaluatethem Specifically her decisions will be whichloans in the second part to invest in and evaluat-ing her decision will require looking at the actualoutcome of those loans and seeing how muchmoney they returned

FIG 4 Proportion of Fully Paid versus Defaultfor terminated loans (by November 2017)

196 COHEN ET AL

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 3: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

According to InvestmentZen as of May 2017 the inter-est rates range from 67 to 228 depending on theloan term and the rating of the borrower and defaultrates vary between 13 and 106

LendingClub issues loans between $1000 and$40000 for a duration of either 36 or 60 months Asmentioned the interest rates for borrowers are deter-mined based on personal information such as creditscore and annual income A screenshot of the Lending-Club home page is shown in Figure 1 In additionLendingClub categorizes its loans using a gradingscheme (grades A B C D E F and G where gradeA corresponds to the loans judged to be lsquolsquosafestrsquorsquo byLendingClub) Individual investors can browse loanlistings online before deciding which loans(s) to investin (Fig 2) Each loan is split into multiples of $25called notes (eg for a $2000 loan there will be 80notes of $25 each) Investors can obtain more detailedinformation on each loan by clicking on the loanmdashFigure 3 shows an example of the additional infor-mation available for a given loan Investors can thenpurchase these notes in a similar manner to lsquolsquosharesrsquorsquoof a stock in an equity market Of course the saferthe loan the lower the interest rate and so investorshave to balance risk and return when deciding whichloans to invest in

One of the interesting features of the peer-to-peerlending market is the richness of the historical data

available The two largest US platforms (LendingCluband Prosper) have chosen to give free access to theirdata to potential investors This raises a whole host ofquestions for investors such as Jasmin

Are these data valuable when selecting loans to in-vest in How could an investor use these data to de-

velop tools to guide investment decisions What is the impact of using data-driven tools on

the portfolio performance relative to ad hoc in-vestment strategies What average returns can an investor expect from

informed investments in peer-to-peer loans

The goal of this case study is to provide answers tothe questions above In particular we investigate howdata analytics and machine learning tools can be usedin the context of peer-to-peer lending investmentsWe use the historical data from loans that were issuedon LendingClub between January 2009 and November2017

Data sets and descriptive statisticsAs mentioned the data sets from LendingClub (andProsper) are publicly available online These datasets contain comprehensive information on all loansissued between 2007 and the third quarter of 2017 (anew updated data set is made available every quarter)

FIG 1 Screenshot of the LendingClub home page

wwwinvestmentzencompeer-to-peer-lending-for-investorslendingclub (lastaccessed June 2018)

The analysis in this case study focuses on the LendingClub data However asimilar analysis could be conducted using Prosper data

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 193

The data set includes hundreds of features includingthe following for each loan

1 Interest rate2 Loan amount3 Monthly installment amount4 Loan status (eg fully paid default charged-off)5 Several additional attributes related to the bor-

rower such as type of house ownership annualincome monthly FICO score debt-to-incomeratio and number of open credit lines

The data set used in this case study contains morethan 750000 loan listings with a total value exceeding$107 billion In this data set 998 of the loans werefully funded (at LendingClub partially funded loansare issued only if the borrower agrees to receive a partialloan) Note that there is a significantly larger number oflistings starting from 2016 relative to previous years

The definition of each loan status is summarized inTable 1 Current refers to a loan that is still being reim-

bursed in a timely manner Late corresponds to a loanon which a payment is between 16 and 120 days over-due If the payment is delayed by more than 121 daysthe loan is considered to be in Default If LendingClubhas decided that the loan will not be paid off then it isgiven the status of Charged-Offxx

These dynamics imply that 5 months after the termof each loan has ended every loan ends in one of twoLendingClub statesmdashfully paid or charged-off Wecall these two statuses fully paid and defaulted respec-tively and we refer to a loan that has reached one ofthese statuses as expired

One way to simplify the problem is to consider onlyloans that have expired at the time of analysis For ex-ample for an analysis carried out in April 2018 thisimplies looking at all 36-month loans issued on or

FIG 2 Example of loan listings (source LendingClub website date accessed May 2018)

xxNote that sometimes the lsquolsquoCharged-Offrsquorsquo status will occur before lsquolsquoDefaultrsquorsquo ifwhenthe borrower has filed bankruptcy or has notified the intermediary platformFor example if a borrower defaults on a loan in the last month of a 36-monthloan it would take another 5 months for the loan to be charged-off

194 COHEN ET AL

before October 31 2014 and all 60-month loans issuedon or before October 31 2012

As illustrated in Figure 4 a significant portion(135) of loans ended in Default status dependingon how much of the loan was paid back these loansmight have resulted in a significant loss to investorswho had invested in them The remainder was FullyPaidmdashthe borrower fully reimbursed the loanrsquos out-

standing balance with interest and the investor earneda positive return on his or her investment Therefore toavoid unsuccessful investments our goal is to estimatewhich loans are more (resp less) likely to default andwhich will yield low (resp high) returns To addressthis question we investigate several machine learningtools and show how one can use historical data to con-struct informed investment strategies

Investment strategies and portfolio constructionMaking predictions and constructing a portfolio in thecontext of online peer-to-peer lending can be challeng-ing The volume of data available provides an opportu-nity to develop sophisticated data-driven methods Inpractice an investor such as Jasmin would seek to con-struct a portfolio with the highest possible return

FIG 3 Example of a detailed loan listing for a grade A loan Information available to investors includes thelength of the borrowerrsquos employment the borrowerrsquos credit score range and their gross income among others(source LendingClub website date accessed July 2018)

Table 1 Loan statuses in LendingClub

Number of days past due Status

0 Current16ndash120 Late121ndash150 Default150+ Charged-off

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 195

subject to constraints imposed by her risk tolerancebudget and diversification requirements (eg nomore than 25 of loans with grades E or F) In thiscase study we investigate the extent to which usingpredictive models can increase portfolio perfor-mance

On important thing that Jasmin will encounter as inmany real applications of predictive analytics is that itis far from trivial to progress from building a predictivemodel to using the model to make intelligent decisionsIn her prior classes Jasminrsquos exercises often ended withestimating the predictive ability of models on out-of-sample data She will do that here as wellmdashbut then shewill have to figure out how to estimate the return to ex-pect from an investment She will find that even with aseemingly good predictive model in hand estimatingthe return of an investment requires additional analysis

Questions and Teaching MaterialIn this section we present a number of teaching planseach of which introduces certain data science conceptsin the context of the case study presented in the StoryLine section We divide our analysis into six parts Intro-duction and Objectives Data Ingestion and CleaningData Exploration Predictive Models for Default Invest-ment Strategies and Optimization Each part is self-contained and includes several questions that could beassigned to students Each question is followed by detailedexplanations and pointers for class discussions Severalparts of this section refer to supplementary Jupyternotebooks that can be obtained from the companionwebsite for this case at guettacomlc_case together

with additional teaching materials The details of thedata used in this case study and a brief description of thestructure of the notebooks are provided in the Appen-dix

Part I Introduction and objectivesIn this part of the case study we begin by taking stockof Jasminrsquos problem and create a framework we willlater use to solve it

1 Fundamentally what decisions will Jasmin needto make

Solution We begin with this question to em-phasize the importance of grounding any studyof a data set in reality Indeed the best way towork with a data set will strongly depend onwhat our ultimate goal is

Arguably there are two decisions Jasmin mightneed to make here

She will first need to decide how much of hermoney to invest in LendingClub and howmuch to allocate to other options for investmentThis would of course also require data about herother options We are not considering this deci-sion in this case

Then once she has decided how much to investin LendingClub she will need to decide the exactloans in which to invest her budget This is the de-cision we focus on

We note that depending on which of the twodecisions Jasmin needs to make her data require-ments will be different as will the techniques shemight use

2 What is Jasminrsquos objective when making these de-cisions How will she be able to distinguish lsquolsquobet-terrsquorsquo decisions from lsquolsquoworsersquorsquo ones

Solution For this case we consider this prob-lem to have a clear and simple objectivemdashto makeas much money as possible

A discussion here should begin with a generaldescription of what this kind of performanceevaluation would look like Conceptually this isnot too difficultmdashJasmin should split her datainto two parts She should use the first part tomake her decisions and the second to evaluatethem Specifically her decisions will be whichloans in the second part to invest in and evaluat-ing her decision will require looking at the actualoutcome of those loans and seeing how muchmoney they returned

FIG 4 Proportion of Fully Paid versus Defaultfor terminated loans (by November 2017)

196 COHEN ET AL

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 4: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

The data set includes hundreds of features includingthe following for each loan

1 Interest rate2 Loan amount3 Monthly installment amount4 Loan status (eg fully paid default charged-off)5 Several additional attributes related to the bor-

rower such as type of house ownership annualincome monthly FICO score debt-to-incomeratio and number of open credit lines

The data set used in this case study contains morethan 750000 loan listings with a total value exceeding$107 billion In this data set 998 of the loans werefully funded (at LendingClub partially funded loansare issued only if the borrower agrees to receive a partialloan) Note that there is a significantly larger number oflistings starting from 2016 relative to previous years

The definition of each loan status is summarized inTable 1 Current refers to a loan that is still being reim-

bursed in a timely manner Late corresponds to a loanon which a payment is between 16 and 120 days over-due If the payment is delayed by more than 121 daysthe loan is considered to be in Default If LendingClubhas decided that the loan will not be paid off then it isgiven the status of Charged-Offxx

These dynamics imply that 5 months after the termof each loan has ended every loan ends in one of twoLendingClub statesmdashfully paid or charged-off Wecall these two statuses fully paid and defaulted respec-tively and we refer to a loan that has reached one ofthese statuses as expired

One way to simplify the problem is to consider onlyloans that have expired at the time of analysis For ex-ample for an analysis carried out in April 2018 thisimplies looking at all 36-month loans issued on or

FIG 2 Example of loan listings (source LendingClub website date accessed May 2018)

xxNote that sometimes the lsquolsquoCharged-Offrsquorsquo status will occur before lsquolsquoDefaultrsquorsquo ifwhenthe borrower has filed bankruptcy or has notified the intermediary platformFor example if a borrower defaults on a loan in the last month of a 36-monthloan it would take another 5 months for the loan to be charged-off

194 COHEN ET AL

before October 31 2014 and all 60-month loans issuedon or before October 31 2012

As illustrated in Figure 4 a significant portion(135) of loans ended in Default status dependingon how much of the loan was paid back these loansmight have resulted in a significant loss to investorswho had invested in them The remainder was FullyPaidmdashthe borrower fully reimbursed the loanrsquos out-

standing balance with interest and the investor earneda positive return on his or her investment Therefore toavoid unsuccessful investments our goal is to estimatewhich loans are more (resp less) likely to default andwhich will yield low (resp high) returns To addressthis question we investigate several machine learningtools and show how one can use historical data to con-struct informed investment strategies

Investment strategies and portfolio constructionMaking predictions and constructing a portfolio in thecontext of online peer-to-peer lending can be challeng-ing The volume of data available provides an opportu-nity to develop sophisticated data-driven methods Inpractice an investor such as Jasmin would seek to con-struct a portfolio with the highest possible return

FIG 3 Example of a detailed loan listing for a grade A loan Information available to investors includes thelength of the borrowerrsquos employment the borrowerrsquos credit score range and their gross income among others(source LendingClub website date accessed July 2018)

Table 1 Loan statuses in LendingClub

Number of days past due Status

0 Current16ndash120 Late121ndash150 Default150+ Charged-off

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 195

subject to constraints imposed by her risk tolerancebudget and diversification requirements (eg nomore than 25 of loans with grades E or F) In thiscase study we investigate the extent to which usingpredictive models can increase portfolio perfor-mance

On important thing that Jasmin will encounter as inmany real applications of predictive analytics is that itis far from trivial to progress from building a predictivemodel to using the model to make intelligent decisionsIn her prior classes Jasminrsquos exercises often ended withestimating the predictive ability of models on out-of-sample data She will do that here as wellmdashbut then shewill have to figure out how to estimate the return to ex-pect from an investment She will find that even with aseemingly good predictive model in hand estimatingthe return of an investment requires additional analysis

Questions and Teaching MaterialIn this section we present a number of teaching planseach of which introduces certain data science conceptsin the context of the case study presented in the StoryLine section We divide our analysis into six parts Intro-duction and Objectives Data Ingestion and CleaningData Exploration Predictive Models for Default Invest-ment Strategies and Optimization Each part is self-contained and includes several questions that could beassigned to students Each question is followed by detailedexplanations and pointers for class discussions Severalparts of this section refer to supplementary Jupyternotebooks that can be obtained from the companionwebsite for this case at guettacomlc_case together

with additional teaching materials The details of thedata used in this case study and a brief description of thestructure of the notebooks are provided in the Appen-dix

Part I Introduction and objectivesIn this part of the case study we begin by taking stockof Jasminrsquos problem and create a framework we willlater use to solve it

1 Fundamentally what decisions will Jasmin needto make

Solution We begin with this question to em-phasize the importance of grounding any studyof a data set in reality Indeed the best way towork with a data set will strongly depend onwhat our ultimate goal is

Arguably there are two decisions Jasmin mightneed to make here

She will first need to decide how much of hermoney to invest in LendingClub and howmuch to allocate to other options for investmentThis would of course also require data about herother options We are not considering this deci-sion in this case

Then once she has decided how much to investin LendingClub she will need to decide the exactloans in which to invest her budget This is the de-cision we focus on

We note that depending on which of the twodecisions Jasmin needs to make her data require-ments will be different as will the techniques shemight use

2 What is Jasminrsquos objective when making these de-cisions How will she be able to distinguish lsquolsquobet-terrsquorsquo decisions from lsquolsquoworsersquorsquo ones

Solution For this case we consider this prob-lem to have a clear and simple objectivemdashto makeas much money as possible

A discussion here should begin with a generaldescription of what this kind of performanceevaluation would look like Conceptually this isnot too difficultmdashJasmin should split her datainto two parts She should use the first part tomake her decisions and the second to evaluatethem Specifically her decisions will be whichloans in the second part to invest in and evaluat-ing her decision will require looking at the actualoutcome of those loans and seeing how muchmoney they returned

FIG 4 Proportion of Fully Paid versus Defaultfor terminated loans (by November 2017)

196 COHEN ET AL

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 5: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

before October 31 2014 and all 60-month loans issuedon or before October 31 2012

As illustrated in Figure 4 a significant portion(135) of loans ended in Default status dependingon how much of the loan was paid back these loansmight have resulted in a significant loss to investorswho had invested in them The remainder was FullyPaidmdashthe borrower fully reimbursed the loanrsquos out-

standing balance with interest and the investor earneda positive return on his or her investment Therefore toavoid unsuccessful investments our goal is to estimatewhich loans are more (resp less) likely to default andwhich will yield low (resp high) returns To addressthis question we investigate several machine learningtools and show how one can use historical data to con-struct informed investment strategies

Investment strategies and portfolio constructionMaking predictions and constructing a portfolio in thecontext of online peer-to-peer lending can be challeng-ing The volume of data available provides an opportu-nity to develop sophisticated data-driven methods Inpractice an investor such as Jasmin would seek to con-struct a portfolio with the highest possible return

FIG 3 Example of a detailed loan listing for a grade A loan Information available to investors includes thelength of the borrowerrsquos employment the borrowerrsquos credit score range and their gross income among others(source LendingClub website date accessed July 2018)

Table 1 Loan statuses in LendingClub

Number of days past due Status

0 Current16ndash120 Late121ndash150 Default150+ Charged-off

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 195

subject to constraints imposed by her risk tolerancebudget and diversification requirements (eg nomore than 25 of loans with grades E or F) In thiscase study we investigate the extent to which usingpredictive models can increase portfolio perfor-mance

On important thing that Jasmin will encounter as inmany real applications of predictive analytics is that itis far from trivial to progress from building a predictivemodel to using the model to make intelligent decisionsIn her prior classes Jasminrsquos exercises often ended withestimating the predictive ability of models on out-of-sample data She will do that here as wellmdashbut then shewill have to figure out how to estimate the return to ex-pect from an investment She will find that even with aseemingly good predictive model in hand estimatingthe return of an investment requires additional analysis

Questions and Teaching MaterialIn this section we present a number of teaching planseach of which introduces certain data science conceptsin the context of the case study presented in the StoryLine section We divide our analysis into six parts Intro-duction and Objectives Data Ingestion and CleaningData Exploration Predictive Models for Default Invest-ment Strategies and Optimization Each part is self-contained and includes several questions that could beassigned to students Each question is followed by detailedexplanations and pointers for class discussions Severalparts of this section refer to supplementary Jupyternotebooks that can be obtained from the companionwebsite for this case at guettacomlc_case together

with additional teaching materials The details of thedata used in this case study and a brief description of thestructure of the notebooks are provided in the Appen-dix

Part I Introduction and objectivesIn this part of the case study we begin by taking stockof Jasminrsquos problem and create a framework we willlater use to solve it

1 Fundamentally what decisions will Jasmin needto make

Solution We begin with this question to em-phasize the importance of grounding any studyof a data set in reality Indeed the best way towork with a data set will strongly depend onwhat our ultimate goal is

Arguably there are two decisions Jasmin mightneed to make here

She will first need to decide how much of hermoney to invest in LendingClub and howmuch to allocate to other options for investmentThis would of course also require data about herother options We are not considering this deci-sion in this case

Then once she has decided how much to investin LendingClub she will need to decide the exactloans in which to invest her budget This is the de-cision we focus on

We note that depending on which of the twodecisions Jasmin needs to make her data require-ments will be different as will the techniques shemight use

2 What is Jasminrsquos objective when making these de-cisions How will she be able to distinguish lsquolsquobet-terrsquorsquo decisions from lsquolsquoworsersquorsquo ones

Solution For this case we consider this prob-lem to have a clear and simple objectivemdashto makeas much money as possible

A discussion here should begin with a generaldescription of what this kind of performanceevaluation would look like Conceptually this isnot too difficultmdashJasmin should split her datainto two parts She should use the first part tomake her decisions and the second to evaluatethem Specifically her decisions will be whichloans in the second part to invest in and evaluat-ing her decision will require looking at the actualoutcome of those loans and seeing how muchmoney they returned

FIG 4 Proportion of Fully Paid versus Defaultfor terminated loans (by November 2017)

196 COHEN ET AL

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 6: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

subject to constraints imposed by her risk tolerancebudget and diversification requirements (eg nomore than 25 of loans with grades E or F) In thiscase study we investigate the extent to which usingpredictive models can increase portfolio perfor-mance

On important thing that Jasmin will encounter as inmany real applications of predictive analytics is that itis far from trivial to progress from building a predictivemodel to using the model to make intelligent decisionsIn her prior classes Jasminrsquos exercises often ended withestimating the predictive ability of models on out-of-sample data She will do that here as wellmdashbut then shewill have to figure out how to estimate the return to ex-pect from an investment She will find that even with aseemingly good predictive model in hand estimatingthe return of an investment requires additional analysis

Questions and Teaching MaterialIn this section we present a number of teaching planseach of which introduces certain data science conceptsin the context of the case study presented in the StoryLine section We divide our analysis into six parts Intro-duction and Objectives Data Ingestion and CleaningData Exploration Predictive Models for Default Invest-ment Strategies and Optimization Each part is self-contained and includes several questions that could beassigned to students Each question is followed by detailedexplanations and pointers for class discussions Severalparts of this section refer to supplementary Jupyternotebooks that can be obtained from the companionwebsite for this case at guettacomlc_case together

with additional teaching materials The details of thedata used in this case study and a brief description of thestructure of the notebooks are provided in the Appen-dix

Part I Introduction and objectivesIn this part of the case study we begin by taking stockof Jasminrsquos problem and create a framework we willlater use to solve it

1 Fundamentally what decisions will Jasmin needto make

Solution We begin with this question to em-phasize the importance of grounding any studyof a data set in reality Indeed the best way towork with a data set will strongly depend onwhat our ultimate goal is

Arguably there are two decisions Jasmin mightneed to make here

She will first need to decide how much of hermoney to invest in LendingClub and howmuch to allocate to other options for investmentThis would of course also require data about herother options We are not considering this deci-sion in this case

Then once she has decided how much to investin LendingClub she will need to decide the exactloans in which to invest her budget This is the de-cision we focus on

We note that depending on which of the twodecisions Jasmin needs to make her data require-ments will be different as will the techniques shemight use

2 What is Jasminrsquos objective when making these de-cisions How will she be able to distinguish lsquolsquobet-terrsquorsquo decisions from lsquolsquoworsersquorsquo ones

Solution For this case we consider this prob-lem to have a clear and simple objectivemdashto makeas much money as possible

A discussion here should begin with a generaldescription of what this kind of performanceevaluation would look like Conceptually this isnot too difficultmdashJasmin should split her datainto two parts She should use the first part tomake her decisions and the second to evaluatethem Specifically her decisions will be whichloans in the second part to invest in and evaluat-ing her decision will require looking at the actualoutcome of those loans and seeing how muchmoney they returned

FIG 4 Proportion of Fully Paid versus Defaultfor terminated loans (by November 2017)

196 COHEN ET AL

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 7: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

Students may be tempted to end the conversa-tion here This is an excellent place to emphasizethat things can sometimes sound very simple butcan be fiendishly difficult when the details areconsidered In particular in this case it is difficultto calculate exactly lsquolsquohow much moneyrsquorsquo a givenset of loans will return Consider for examplethe following

Some loans are 36 months long and someare 60 months long Given two loans withthe same interest rate and risk profile butdifferent lengths it is unclear which Jasminshould pick Clearly loan defaulting is a bad outcome

However loans default at different timesmdashsome loans will default soon after they aretaken out and some much later How shouldJasmin consider those variations Some loans might be repaid early before

they expire This presumably is undesir-able since it results in a lost opportunity toearn interest in the remaining period of theloan However how should Jasmin takethat into account

We discuss these issues in greater detail in alater section for now students should get theidea that things are not as simple as they look

A final complexity involves how to split thedata into these two parts Students might intuit

that the data should be split in time for thesetwo partsmdashwe discuss this in more detail later

3 Why would we even think past data would behelpful here How could Jasmin use past data tohelp make these decisions

Solution This is once again a question thatseems simple at first glance but hides somecomplexity

Conceptually Jasmin should look at past dataand use them to figure out what loan characteris-tics tend to indicate a loan will be lsquolsquogoodrsquorsquo

The first implicit assumption here revolvesaround the decision to look at each loan individ-ually rather than groups of loans Indeed if cer-tain groups of loans are correlated to eachother the estimation problem and correspondingstrategy will be far more complicated

Second it is unclear what we mean by a lsquolsquogoodrsquorsquoloan There are at least four things Jasmin mightwant to predict

Whether a loan will default Whether a loan will be paid back early If it defaults how soon will this happen If it is paid back early how soon will it be

paid back

There are however other possibilities for ex-ample combining one or more of these measuresWe discuss some of these later This of course isrelated to the discussion in the previous section

FIG 5 Comparison of the different classification models

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 197

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 8: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

4 Take a look at the data (the student will be askedto download the data in Part II below) Write ahigh-level description of the different lsquolsquoattri-butesrsquorsquomdashthe variables describing the loans Howwould you categorize these attributes Which doyou think are most important to an investorsuch as Jasmin

Solution The idea here is to start talking aboutthe data set understand the different attributestherein and potentially engage in a discussionof what the most important attributes might beIt is also an opportunity to emphasize that anydiscussion of the data set should be grounded inthe purpose of looking at the data in the firstplace (ie the points discussed above)

In particular a discussion could note thefollowing

Some attributes are related to the borrowersrsquocharacteristics (eg FICO score employ-ment status annual income) others are re-lated to the platformrsquos decisions (eg loangrade interest rate) and the rest are relatedto the loan performance (eg status totalpayment)

Students might also note that there issome overlap between some of these vari-ables in that the value of one might informthe value of another (eg all else beingequal a defaulted loan is likely to have alower total payment than a paid off one)

As mentioned above the variables relat-ing to return will need to be worked into aform that is useful for our analysis

Some attributes are categorical (eg employ-ment status) whereas some others are numeric(eg FICO score) The ways to handle categor-ical and numeric variables are different Some attributes are constantly updated

while some are set once and for all This isan important point to observe and we dis-cuss it in greater detail in the solution toquestion 5 (next)

Finally the discussion could mention the factthat investors will be most interested in the returnon their investments (either actual return or an-nualized return) Note that there is no variablein the data set with the return information ofeach loan Instead one needs to carefully calcu-late the return by using the data available (this

is far from trivial as discussed above and ingreat detail in Part III below) The relevant vari-ables for calculating the return are the loan statusthe total payment the funded amount the feesgenerated and the loan duration

5 When looking through the data you might havenoticed that some of these variables seem relatedFor example the total_pymnt variable is likely to bestrongly correlated to the loan status (Why) Whywould this matter and how would you check

Solution The purpose of this question is to in-troduce the concept of leakage Most generallyleakage is a situation in which a model is builtusing data that will not be available at the timethe model will be used to make a predictionLeakage is particularly insidious when thosedata give information on the target being pre-dicted Leakage is a subtle concept and can takemyriad forms This question and the next illus-trate two specific examples of leakage in the con-text of this case and should provide usefulmaterial to discuss the subject

This first example illustrates one form of leak-age in which a variable in the data set is highlycorrelated with the target variable Training amodel using this attribute is likely to lead to ahighly predictive model However at the timepredictions will need to be made (ie when decid-ing which future loans to invest in) these data willnot be available

In the specific example mentioned here the totalpayments made on the loan will trivially be corre-lated to all four of the measures mentioned above(loan default loan early repayment and time ofaforementioned events) Indeed if a loan defaultsor is paid back early total payments on the loanare likely to be lower Thus using that variablefor prediction will likely result in a strong modelperformance However when investing in futureloans Jasmin would not have access to the totalamount that would eventually be repaid on thatloan and thus a model trained using that variablecould not be used to make future predictions

6 Based on the variable names in the data it is un-clear whether the values of these variables are cur-rent as of the date the loan was issued or as of thedate the data were downloaded For examplesuppose you download the data in December

198 COHEN ET AL

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 9: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

2017 and consider the fico_range_low variablefor a loan that was issued in January 2015 It is un-clear whether the score listed was the score in Jan-uary 2015 or the score in December 2017 Whywould this matter and how might you check

Solution This is another more general form ofthe type of leakage observed in the previous ques-tion Many of the variables in the data set are infact updated over time and in some cases the up-date is highly indicative of the outcome For ex-ample if a loan defaults the borrowerrsquos FICOscore might go down Thus for the same reasonsas above using this score in a predictive modelwould result in overly optimistic results

Detecting this form of leakage is difficult givena static data set High correlation with the targetvariable is certainly one diagnostic that could beuseful The best method to eliminate leakage isto obtain data from the time point and context inwhich they would have been used We may haveto simulate this for example going back to data-base files from the point in time where the decisionwould have been made (which may be different forevery decision) Even detecting the possibility ofleakage is very difficult One method is to comparethe data from the time point of decision-makingwith later data to see what variables have changedWe illustrate this approach below

Part II Data ingestion and cleaningIn this part of the case we ingest the LendingClub dataand clean them to make sure they can be used formodel fitting

1 Download the LendingClub data set from theLendingClub website and read it into your pro-gramming language of choice You will noticethe data are provided in the form of many indi-vidual files each spanning a certain time periodCombine the different files into a single data set

Solution See the ingestion_cleaning notebookThe code provided illustrates a few key features

of Pandas

Pandas basic abilities for indexing tablesconcatenating different tables and so on The ability to read from zip files directly

without decompressing them The ability to read comma separated values

(CSV) files in which some lines are irrele-vantmdashthese lines can simply be skipped

When combining the multiple files the codealso carefully

ensures the different files have identical for-mats ensures each file has the same set of primary

keys (loan IDs) and ensures the loan IDs are indeed a unique pri-

mary key once the loans are joined

2 Together with this case you may have received acopy of the data downloaded from the Lending-Club website in 2017 If so compare these twosets of filesmdashdoes anything appear amiss

Solution This is a practical illustration of theissue of leakage described above Students mightbe tempted to use every attribute from the data intheir models Unfortunately this would be ill-advised because LendingClub updates some ofthese variables as time goes by Thus many of thevariables in the data table will contain data thatwere not available at the time the loan was issuedand therefore should not be used to create an invest-ment strategy

As discussed above one common pitfall thatarises in building and assessing predictive modelsis leakage from a situation in which the value ofthe target variable is known back to the settingin which we evaluate the modelmdashwhere we arepretending that the target variable is not knownThis leakage can occur for example through an-other variable correlated with the target variableNote that leakage is insidious because it often af-fects both the training and test data and so typ-ical evaluations will give overly optimistic results

The notebook contains code that looks at thetwo sets of files and compares values that mayhave changed between them It also ensures thatattributes of interest have not changed too much

It is interesting to note that the interest_ratevariable does change occasionally after the loanis issued We will eventually get rid of this vari-able in building our models so this should notbe too worrying but it is worth noting

3 Remove all instances (in our case rows in the datatable) representing loans that are still current (iethat are not in status lsquolsquoFully Paidrsquorsquo lsquolsquoCharged-Offrsquorsquoor lsquolsquoDefaultrsquorsquo) and all loans that were issued be-fore January 1 2009 Discuss the appropriatenessof these filtering steps

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 199

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 10: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

Solution This is straightforward to do (see thenotebook) but hides considerable complexity

Indeed looking back at the body of the casethe method we suggested there to select loanswas subtly different We suggested that weshould only look at 36-month loans issued be-fore October 2014 and 60-month loans issuedbefore October 2012 This would ensure thatall loans in the period selected had expired (al-though it would introduce a correlation betweenthe issue date of the loan and its length in ourdata)

Using the method here instead has two effects

We oversample loans issued earliermdashindeedloans that are issued earlier are more likelyto have expired at the time of our analysisThis is less of an issue for two reasons

mdashLendingClub has grown tremendouslyover the past few years and the number ofloans issued by the platform has thereforeincreased Thus a time imbalance alreadyexists in our data set regardless

mdashAs demonstrated below our modelsare remarkably stable across the time hori-zon consideredmdashmodels trained on the2009 data perform just as well on the 2017data as models trained more recently

We oversample shorter loansmdashindeed atany given point in time the set of all expiredloans will by definition contain a largernumber of shorter loans (ie loans thathave expired early)

This is a more worrisome issue sinceshorter loans might be more or less likelyto default This in turn might skew ourevaluation metrics Suppose that in our fulldata set the average default rate is x Add-ing a greater proportion of shorter loansmight change this number Our evaluationwould then be against this new numberand might not reflect the actual perfor-mance of the model out of sample

Nevertheless discarding these loanswould result in our throwing away a largeamount of data that could be used to trainthese models We therefore keep thoseloans in this case but performing the anal-ysis without these loans would form an ex-cellent follow-up exercise

4 Visualize each of the attributes in the file Arethere any outliers If yes remove these

Solution This question is an opportunity todemonstrate different visualization methods ap-propriate for the different attributes The inges-tion_cleaning notebook demonstrates the use ofbox-and-whisker plots for continuous variablesand histograms for discrete variables but addi-tional methods could also be introduced

This data set is also interesting in that many at-tributes do not have a clear point at which outliersshould be cut off The notebook suggests somecutoff points but they are by no means the onlypossible ones

An interesting discussion could be aroundsituations in which outliers appear in target var-iables or in leakage variables (see Question 5 inPart 1 above) Since the values of these variableswould not be known at the time of learning oruse removing instances with outlying valuesin these variables could render the data set un-realistic For simplicity we ignore this issue inour analysis

5 Save the resulting data set in a Python lsquolsquopicklersquorsquo Forthe sake of this case restrict yourself to the follow-ing attributes id loan_amnt funded_amnt termint_rate grade emp_length home_ownershipannual_inc verification_status issue_d loan_statuspurpose dti delinq_2yrs earliest_cr_line open_accpub_rec fico_range_high fico_range_low revol_-bal revol_util total_pymnt and recoveries

Solution This question introduces the use ofpickles a useful tool in Python In a later ques-tion we discuss the rationale for selecting thesespecific attributes

Part III Data explorationIn this part we explore the data we cleaned in the pre-vious part

1 The most important data we will need in deter-mining the return of each loan are the total pay-ments that were received on each loan Thereare two variables related to this informationmdashtotal_pymnt and recoveries Investigate thesetwo variables and for each loan determine thetotal payment made on each loan

Solution The purpose of this question (andindeed of every question in the data preparationpart of this case) is to encourage students to be

200 COHEN ET AL

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 11: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

hypercritical of data they obtain and to fully un-derstand them before analyzing them

In this case there are two variables of concernmdashtotal_pymnt which presumably contains the totalpayment made on that loan and recoveries whichpresumably contains any money recovered afterthe loan defaulted A question students shouldcome to is lsquolsquodoes the total_pymnt variable includethose recoveries or do they need to be added onrsquorsquo

To investigate this matter an additional dataset is required also available from LendingClubThe data set listsmdashfor every loanmdashevery paymentthat was made chronologically on that loanUsing this data set the ingestion_cleaning note-book confirms that the total_pymnt variabledoes include all payments

This question also introduces another impor-tant Python concept the handling of files thatare too large to fit in memory In this case the de-tailed payment file is so large that it cannot beloaded all at once on most personal computersthe notebook uses a Pandas iterator to considerthe file line-by-line in performing this check

2 A key measure we will need in working out an in-vestment strategy is the return on each loandefaulted or otherwise How might you calculatethis return Add this new variable to the data

Solution At first sight this question appearssimple As mentioned at the outset in reality itis anything but Calculating the return is compli-cated by two factors (1) the return should takeinto account defaulted loans which usually arepartially paid off and (2) the return should alsotake into account loans that have been paidearly (ie before the loan term is completed)

This issue could lead to an interesting class dis-cussion A more complex way to handle this taskis to build a dynamic model in which we take intoaccount potential future reinvestments if a loan isrepaid early

Assuming we want to avoid this complexitythe following three methods are among themost obvious ways to convert this to a static prob-lem Before we list these methods we define thefollowing notation

f is the total amount invested in the loan p is the total amount repaid and recovered

from the loan including monthly repay-ments and any recoveries received later

t is the nominal length of the loan in months(ie the time horizon the loan was initiallyissued for this will be 36 or 60 months) m is the actual length of the loan in

monthsmdashthe number of months from thedate the loan was issued to the date the lastpayment was made

Having established this notation the threemethods are as follows Method 1 (M1mdashPessimistic) supposes that

once the loan is paid back the investor isforced to sit with the money without rein-vesting it anywhere else until the term ofthe loan In some sense this is a worst-case scenariomdashinterest is only earned untilthe loan is repaid but the investor cannotreinvest

Under this assumption the (annualized)return can be calculated as

p ff

middot12t

The downside of this method is obvi-ousmdashthe assumptions are hardly realistic

Note that this method handles defaultsgracefully without artificially blowing upthe returns Indeed since the investor wasinitially intending for his or her investmentto remain lsquolsquolocked uprsquorsquo for the term of theloan it is reasonable to spread the resultingloss over that term

For loans that go to term (ie are not re-paid early and do not default) this methodalso treats long- and short-term loans in thesame way For loans that are repaid earlyhowever this method favors short-termloans because the gain realized before theloan is repaid is spread over a shorter timespan Similarly for loans that default earlythis method favors long-term loans becausethe loss is spread out over a greater timespan

Method 2 (M2mdashOptimistic) supposes thatonce the loan is paid back the investorrsquosmoney is returned and the investor can im-mediately invest in another loan with exactlythe same return

In that case the (annualized) return caneasily be calculated as

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 201

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 12: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

p ff

middot12m

The upside of this method is that it issimple and takes into account the fact thatfunds would (or could) be reinvested Themoney is indeed returned to the investorafter the loan is repaid and the investor isable to reinvest the cash

It is worth noting that this method isequivalent to simply considering the annu-alized monthly return of the loan over thetime it was active effectively treating aloan that was repaid early and a loan thatwent to term in the same way

There are however two drawbacks Thefirst is the assumption that the cash canbe reinvested at the same rate (althoughthe method could be modified to assumethe cash can be reinvested at a lowerratemdasheg the prevailing prime rate) Thesecond more worrying drawback is that ifa loan defaults early annualizing the losscan result in a huge overestimate ofthe negative return Indeed if a loan de-faults in the first month the investorloses 100 of the investment This is themaximum loss but annualizing it wouldlead to a 1200 lossmdashin other words wewould be assuming the investor reinvestsin an equally risky loan for the 11 remain-ing months of the year each of which de-faults in 1 month Hardly realistic Wetherefore use the following two-pieceformula

p ff 12

m if p f gt 0p f

f 12t if p f 0

(

One last feature worth noting about thismethod is that it treats short and long notesequally Indeed since it is assumed the in-vestor can always reinvest in a note of equalreturn immediately after this note ends theactual term of the loan does not matterDepending on Jasminrsquos priorities thiscould be appropriate or it could notmdashinparticular whether this is appropriate willdepend on the horizon over which she isplanning her investments

Method 3 (M3) considers a fixed time hori-zon (eg T months) and calculates thereturn on investing in a particular loanunder the assumption that any revenuespaid out from the loan are immediately rein-vested at a yearly rate of i compoundedmonthly until the T-month horizon is over(throughout this case study we consider a5-year horizon ie T = 60)

The upside of this method is that it is clos-est to what would realistically happen Itequalizes all differences between loans of dif-ferent lengths and correctly accounts fordefaulted loans The downside is that it un-dervalues the time value of money (in realitysome investors would be unlikely to reinvestat the prime rate and more likely to invest inhigher grossing securities) Of course thiscould be remedied by adapting the value ofthe rate (Indeed if one had an expectedrate of return for the group of loans fromwhich the loan in question was drawn thenthat return could be used yielding possiblydifferent projected future investment returnsfor different loans)

Assuming our notation above and as-suming the payouts were made uniformlythrough the length m of the loan eachmonthly payment was of size pm Assum-ing these are immediately reinvested wecan use the sum of a geometric series tofind the total return from the f initiallyinvested

12T 1

fpm

1 (1thorn i)m

1 (1thorn i)

(1thorn i)T m f

In this case study we report results for allthree methods

3 As discussed in the case LendingClub assigns agrade to each loan from A through G Howmany loans are in each grade What is the defaultrate in each grade What is the average interestrate in each grade What about the average per-centage (annual) return Do these numbers sur-prise you If you had to invest in one gradeonly which loans would you invest in

Solution This is a great opportunity to illus-trate aggregation and lsquolsquogroup byrsquosrsquorsquo in Python

202 COHEN ET AL

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 13: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

Here we also use different definitions of returnsas discussed above

Grade ofloans Default

Avinterest

Mean return

M1 M2 M3 (12) M3 (3)

A 1668 633 722 166 389 205 371B 2886 1348 1085 158 501 202 368C 2799 2241 1407 062 539 139 302D 1544 3037 1754 005 571 092 251E 759 3883 2073 091 595 011 164F 272 4499 2447 143 643 044 105G 073 4816 2712 258 666 140 005

Unsurprisingly the lower the grade the higherthe probability of default and the higher the aver-age interest rate LendingClub charges

It is interesting to note the trend in returns formethods M3 as we move from grade A to grade Band so on In an ideal world we might expect thereturns to be more-or-less identical across gradesEven though default rates are higher for lowergrades the interest rates are much highermdashLendingClub raises the interest rates there tomake up for this increased chance of default Itis therefore good to see that returns do not dropprecipitously for lower grades

The fact that returns do eventually drop faster forlower grades implies either that LendingClub mightfind it harder to set interest rates to exactly offset de-faults for those grades (perhaps as a result of in-creased volatility for lower grade loans) or thatour definition of return is different from Lending-Clubrsquos As we saw there are many ways to definereturn depending on our objective and it is conceiv-able that LendingClubrsquos objectives are different fromthe ones we outline above The rest of this case fo-cuses specifically on making investment decisionswith respect to the objectives highlighted above

Part IV Predictive models for defaultIn this section we use predictive analytics to predicthow likely a loan is to default

1 Using the data provided implement models topredict the probability each loan defaults Youmay want to try the following models decisiontree random forest logistic regression (lsquo1 andlsquo2 penalized) naive Bayes and multilayer percep-tron Carefully explain how you selected optimalmodel parameters and how you evaluated eachmodelmodeling procedure

Solution This question provides an excellentopportunity to teach a number of topics

Each of the models mentioned in thequestion How each of those models is learned from

data Hyperparameter tuning through cross vali-

dation Model evaluation for classification models and Nested cross-validation (ie making sure

that the data used for hyperparameter tun-ing are not used when measuring the ulti-mate performance)In discussing the evaluation of classification

models the following two important measurescould be discussed How well do the modelrsquos estimates of class

probability actually order the loans by theirlikelihood of default This is measured bythe area under the ROC curve (AUC)which is equivalent to the MannndashWhitneyndashWilcoxon statistic Technically the AUCmeasures the following Given two randomlyselected loans one that defaults and one thatdoes not what is the probability that themodel will assign a higher default probabilityto the defaulting loan A model that can per-fectly discriminate defaults from nondefaultswould have an AUC of 10 The calibration of the modelmdashthe extent to

which the probabilities predicted by themodel correspond to the frequency of theevent happening for some natural groupingof events Typically this is measured by con-sidering instances grouped into bins of similarestimated probabilities Since our goal is to usethe probabilities output by the model to calcu-latepredict the expected return of loans thecalibration of the estimated probabilities isimportant here However note that we reallyneed to look at calibration in tandem with ameasure such as AUCmdasha model that forevery example predicted the base rate (thedata set or population default frequency inthis case) would be perfectly calibrated andwould not discriminate cases at allSee the next question for a summary of the

results on a more restricted set of features Asan example using the full set of features in the

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 203

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 14: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

notebook learning a random forest (using theprocedure specified in the notebook) yields anout-of-sample AUC of 070

This is also a good place to discuss calibra-tion and associated diagnostics see the solutionfor the next question for a discussion of this point

One point that bears discussing in more de-tail is the method used for training testing andcross validation In this instance there are atleast two methods we could use Randomly assign each loan to a training set

or a testing set Within the training set ran-domly assign loans to one of the folds for K-fold cross-validation Set a lsquolsquocutoffrsquorsquo date Assign all loans that were

issued before that date to the training set andall loans after that date to the evaluation setWithin the training set create folds using asliding time window

Arguably the latter method is more appropri-ate for this case because it correctly reflects theway the model will in fact be used in practice(trained on an earlier period and then evaluatedon a later period) The first method on theother hand might be more desirable for two rea-sons first it is simpler to implement using stan-dard libraries thus it is likely to be the firstmethod tried by a data scientist even if she orshe were planning to eventually apply the secondmethod Also the first method allows many thou-sands of different traintest sets to be createdusing different random seeds This in turn canbe used to obtain estimates of various errors inthe estimated models as discussed below Thiswould be difficult to do using the second methodFor this reason we use the random assignmentmethod here The intrepid student could be en-couraged to compare with the second method

(NB To ensure that the first method is not tooproblematic we later assess the stability of ourmodels over time by looking at whether a modeltrained in 2009 performs worse in 2017 than amodel trained on more recent data We find thatour modelrsquos performance is remarkably stable)

2 After learning and evaluating these models Jas-min realized that the attributes she used in hermodels were not all underlying facts about theloan applicants but possibly statistics calcu-lated by LendingClub using its own models She

wanted to assess whether the predictive powerof her models came simply from LendingClubrsquosown models Carry out this investigationmdashwhatare your conclusions

Solution This question provides a good op-portunity to discuss the ways information fromsome attributes can be incorporated in othersStudents may be tempted to simply drop thelsquolsquoGradersquorsquo attribute and proceed However in real-ity the attributes describing the interest rate andinstallment amounts also reflect the grade (thehigher the grade the lower the interest rate andinstallment amounts)

Fitting logistic regressions based on grade onlyor interest rate only we obtain models with out-of-sample AUCs of 068 Using only these single var-iables provides just as much predictive power asthe entire set of variables involved as we achievethis same AUC using logistic regression with allthe features

As mentioned Jasmin wanted to assess modelsusing the underlying data only not the attributesderived by LendingClub Removing these attri-butes and refitting the models above we obtainthe following performance values (note thatthese are averages over many different traintestsplitsmdashsee note 1 below)

Model Out-of-sample AUC

Naive Bayes 065lsquo1-penalized logistic regression 069lsquo2-penalized logistic regression 069Decision tree 066Random forest 069Multilayer perceptron 069

The first interesting observation is that the aboveclassifiers uniformly perform substantially betterthan random Logistic regression performing aswell (in this sense) as the nonlinear models suggests(i) that interactions between our variables are notcrucial in modeling probability of default or (ii)that we do not have sufficient training data tolearn the interactions well or (iii) the AUC doesnot reveal the advantage of learning nonlinearities

For simplicity going forward we use randomforest in the rest of this case study The studentmay be challenged to compare the results usingother models

There are three additional points that shouldbe highlighted

204 COHEN ET AL

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 15: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

(a) Examining a modelrsquos performance overa single partition of the data into cross-validation folds can be misleading as it ispossible that the partition gives particularlygood or bad trainingtest splits by chanceTo ensure the robustness of the results it isadvisable to attempt the operations overmany cross-validation runs each using a dif-ferent seed to generate the partitions and thetraintest split The code provided allows thisto be done easily by adjusting the value of theseed We performed 200 independent itera-tions with different seeds and reported theaverage values above The following violinplots (another technique worth discussing)presented in Figure 5 illustrate the differentresults obtained with each method

(b) As discussed in the solution to the pre-vious question the AUC only measuresthe ranking performance of the modelAnother important measure is the calibra-tion which measures whether probabilitiesproduced by the model are correct Thenotebook provided also produces a test ofcalibration for each model and randomforests also perform particularly wellthere In Figure 6 the x-axis correspondsto the default probability predicted by themodel and the y-axis represents the actualproportion of defaulted loans in the test set

3 After modifying her model to ensure she did notinclude data calculated by LendingClub or leak-age affecting the target variable Jasmin wantedto assess the extent to which her scores agreedwith the grades assigned by LendingClub Howmight she do that

Solution This provides a good opportunity tointroduce the idea of rank correlation and a mea-sure for it such as Kendallrsquos tau coefficient Formost of the models considered the coefficient isabove 05 implying a pretty good agreement be-tween the LendingClubrsquos grades and our scores

4 Finally Jasmin had one last concern She wasacutely aware of the fact the data she was usingto train her models dated from as far back as2009 whereas she was hoping to apply themgoing forward She wanted therefore to investigatethe stability of her models over time How mightshe do this

Solution The notebook presents examples ofmodels that are trained over an early periodand evaluated on a later period The performanceis remarkably stable throughout

5 Go back to the original data (before cleaning andattribute selection) and fit a model to predict thedefault probability using all attributes (For thesake of simplicity it will be sufficient to limityourself to the following attributes id loan_amnt funded_amnt funded_amnt_inv termint_rate installment grade sub_grade emp_titleemp_length home_ownership annual_inc veri-fication_status issue_d loan_status purposetitle zip_code addr_state dti total_pymntdelinq_2yrs earliest_cr_line open_acc pub_reclast_pymnt_d last_pymnt_amnt fico_range_highfico_range_low last_fico_range_high last_fico_range_low application_type revol_bal revol_util recoveries) Does anything surprise youabout the performance of this model (out-of-sample) compared with the other models youhave fit in this section

Solution This question strikingly illustrates theconcept of leakage As mentioned above a numberof variables represent data not available at the timethe loan was issued In all the models constructedthus far we have carefully removed these attributesIn this question we reintroduce them (eg last_fi-co_range_high uses the applicantrsquos most recentlyavailable FICO score rather than the score at thetime the loan was issued)

Unsurprisingly using these additional attri-butes improves the modelrsquos performance Whatis truly striking is how much they improve itUsing those attributes we obtain an out-of-sampleAUC of over 099 (see the implementation and re-sults in the modeling_leakage notebook) Indeed anumber of write-ups of analyses of this data setfound online report similar very high performancesof their models without any caveats Guiding stu-dents through the process of getting such a highpredictive performance and then realizing howmisleading the result is could result in a livelyclass discussion

Throughout this case we have instructed stu-dents to keep only certain attributes after cleaningthe data this question provides an opportunity tojustify the set of attributes that was selectedReferring back to Figure 3 the set of attributes

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 205

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 16: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

we chose is precisely the intersection of the attri-butes that are available at the time of investmentand those attributes in the historical data that arenot updated over time Thus the model that re-sults is one that could realistically be used by a po-tential investor such as Jasmin

Part V Investment strategiesJasmin was finally ready to start building investmentstrategies It is worth noting that there are many poten-tial strategies Jasmin could try We consider fourstrategies here but interest students should be en-

couraged to try others too The strategies we con-sider here are as follows

Random strategy (Rand)mdashrandomly picking loansto invest in

Default-based strategy (Def)mdashusing the random forestdefault-predictor models described above Thensorting loans by their estimated default probabilitiesand selecting the loans with the lowest probabilities

FIG 6 Calibration curves of different classification models (a) Naive Bayes (b) logistic regression (Ridge) (c)decision tree and (d) random forest Together with each calibration curve we include histograms representingthe total number of data points with each predicted probability Calibration results are sometimes unstable inregions with very few points It is interesting to note that (as expected) due to the conditional independenceassumption Naive Bayes is more likely to predict extreme scores

Another set of strategies relies on the intuition that any lsquolsquoedgersquorsquo obtained overLendingClub in this case boils down to findingmdashwithin all loans with a certaininterest ratemdashthose that perform best Thus one could imagine a whole class ofstrategies that select the best loans within each grade

206 COHEN ET AL

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 17: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

Simple return-based strategy (Ret)mdashtraining a sim-ple regression (eg Lasso random forest regres-sor multilayer perceptron regressor) model topredict the return on loans directly Then sortingloans by their predicted returns and selecting theloans with the highest predicted returns

Default- and return-based strategy (DefRet)mdashtraining two additional modelsmdashone to predictthe return on loans that did not default and one topredict the return on loans that did default Thenusing the probability of default predicted by therandom forest model above to find the expectedvalue of the return from each future loan Invest inthe loans with the highest expected returns

Note that all these strategies make the assumptionthat if Jasmin decides to invest in a loan she must in-vest in the loan in its entirety (ie she cannot invest in a

fraction of a loan) In reality LendingClub does allowinvestors to invest in fractions of loans and incorporat-ing this additional complexity could further improveher returns

1 First consider the three regression models de-scribed above (regressing against all returnsregressing against returns for defaulted loansand regressing against returns for nondefaultedloans) In each case try lsquo1 and lsquo2 penalized linearregression random forest regression and multi-layer perceptron regression

Solution Again this section could lead to a dis-cussion of each of these methodologies as well aswhether the evaluation of the regression resultsusing standard metrics gives insight into whetherthey will be helpful in solving the business prob-lem The following table and Figure 7 comparethe R2 scores for these models Do they performwell (Again these numbers are averages over anumber of traintest splits) Can you tell

FIG 7 Comparison of the different regression models (a) M1PESS (b) M2OPT (c) M3 (12) and (d) M3 (3)

This is an instance of lsquolsquoanalytical engineeringrsquorsquo described in detail by ProvostFawcett1 where problems are decomposed into subproblems that are solved bywell-understood data science methods and then composed to create theultimate solutionmdashoften guided by an expected value framework

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 207

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 18: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

Model

R2 scores for each return definition

M1 M2 M3 (12) M3 (3)

lsquo1 regressor 0026 0012 0027 0027lsquo2 regressor 0026 0013 0027 0027Multilayer perceptron regressor 0016 0004 0017 0017Random forest regressor 0028 0018 0029 0032

2 Now implement each of the investment strategiesdescribed above using the best performing regres-sor In particular suppose Jasmin were to investin 1000 loans using each of the four strategieswhat would her returns be

Solution See the notebook for implementationdetails The results are presented in the followingtable (the results are averaged over 200 independentiterations) and in Figure 8 The last row (Best possi-ble) corresponds to the top 1000 performing loansin hindsight that is the best 1000 loans Jasmincould have picked

Return calculation method

M1-PESS M2-OPT M3 (12) M3 (3)

Rand 06 52 12 27Def 19 53 14 30Ret 26 56 16 32DefRet 30 57 17 33Best possible 120 271 101 117

In all cases the two-stage analytical engineer-ing strategy performs best The following pointsmay be relevant in a class discussion

The random returns are positive in all casesthis is to be expected and indicates thatLendingClub does set interest rates to offsetthe default risk at least to some degree In light of the first point it is especially

pleasing to see our methods significantlyoutperforming random investmentsmdashgivenLendingClubrsquos ability to set interest rate tooffset defaults extracting value throughthese investments is nontrivial

One point worth highlighting here is the factit is crucial that the loans used in the simulatedtests are not the loans that were used in the train-ing set for building the model The notebook pro-vided ensures that this is indeed the case

An additional interesting observation fromthe plots above is the following

For a given investment strategy (Rand DefRet DefRet) the return performance im-

proves as the return definition is more opti-mistic (as expected) Under a given return definition (M1-PESS

M2-OPT M3) the return performance im-proves as the investment strategy becomesmore sophisticated The highest perfor-mance is obtained for the DefRet strategy

3 How might Jasmin test the stability of theseresults

Solution See the notebook for implementationdetails As described above we use a variety oftraintest splits

4 The strategies above were devised by investing in1000 loans Jasmin is worried however that thestrategy is not scalablemdashin other words if shewanted to increase the number of loans shewanted to invest in she would eventually lsquolsquorunoutrsquorsquo of good loans to invest in Test this hypoth-esis using the best strategy above

Solution The graph in Figure 9 illustrates theexpected return obtained using the M1 returndefinition and the DefRet strategy (a random for-est classifier and a random forest regressor) as afunction of the portfolio size (which varies be-tween 1000 and 9000 loans) These results wereobtained by running a single run out-of-sample

As one can see from Figure 9 the return de-creases with the portfolio size Why is it thecase One explanation is as follows when theportfolio size increases the proportion of goodperforming loans decreases As a result it ishard to maintain the same return performanceThis observation can also be used to discuss thefact that our analysis here assumes that we canpick any of the historical loans In practice how-ever investors are limited to a small number ofavailable loans at each point in time

Part VI OptimizationIn this section we investigate ways Jasmin might im-prove her investment strategy using optimizationCan you formulate the investment problem as an opti-mization model How does your model perform com-pared with the best method above To solve theoptimization problem you can use a solver such asGurobi or CPLEX (free academic licenses are avail-able) The reference manual for Gurobi is available5

Solution This part of the case provides an opportu-nity to teach basic optimization using Python In

208 COHEN ET AL

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 19: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

FIG 8 Comparison of the different investment strategies (a) M1PESS (b) M2OPT (c) M3 (12) and (d)M3 (3)

FIG 9 Returns of the DefRet strategy for different portfolio sizes

209

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 20: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

addition it touches on several modeling ideas thatcould be discussed in class

We begin by creating a binary variable xi 2 f0 1gfor every loan in our data set In particular

xi = 1 if we invest in loan i0 otherwise

We let ri denote the predicted expected return of loan iand Ai the loan amount We let pi denote the predictedprobability loan i will default Finally we let N denotethe number of loans we want to invest in (ie the de-sired size of the portfolio) Throughout this sectionthe return we use for ri is based on M1-PESS

We begin by formulating a simple integer programequivalent to strategy Def above (ie sorting loans bytheir default probabilities and selecting the loans withthe lowest default probabilities)

minx

+i

xipi

st +i

xi = N

xi 2 f0 1g

As we can see from the notebook this strategy yieldsa 236 return (run over a single trainingtest split)Note that one could further incorporate several (linear)constraints such as imposing a maximal allowed valuefor pi

Similarly the strategies in the previous section canall be written in this optimization form For exampleif ri were computed based on DefRet we could rewritethis strategy as follows

minx

+i

xiri

st +i

xi = N

xi 2 f0 1g

This yields a 299 return identical to the returnobtained above as expected

We now aim to improve our return using optimizationOur first attempt is to directly maximize the total revenueof the portfolio (instead of maximizing the return) Thisleads to the following optimization problem

maxx

+i

xiAiri

st +i

xi = N

xi 2 f0 1g

This strategy yields a 259 return that is below ourbest performance of 3

One way to improve this formulation is to add somecomplexity to the way Jasmin constrains the number ofloans she invests in The program above constrains thenumber of loans but does not take into account theamount of each loan Instead Jasmin might add a bud-get constraint representing the total dollar amount shewants to invest We denote this budget by L This leadsto the following problem

maxx

+i

xiAiri

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a lsquolsquoknapsack problemrsquorsquoNote that we replaced the portfolio size equality con-straint (+ixi = N) by two inequality constraints toallow some additional flexibility By testing differentbudget constraints we can increase the model perfor-mance to 318

Finally a widely used approach in portfolio opti-mization is to explicitly consider the standard devia-tion (or variance) of the return that is to account forthe portfolio risk Unlike stocks in the equity marketit is not clear how to estimate the standard deviationof the return for a particular loan as we do not ob-serve the performance of the same loan repeatedlyWe therefore propose to estimate the standard devi-ation by grouping several loans together using thek-means clustering method The procedure goes asfollows

1 Fit a k-means clustering model on the training set(the parameter k can be tuned by the investor)

2 For each cluster compute the standard deviationbased on predicted returns

3 For each loan in the test set generate the clusterlabel by assigning the loan to the lsquolsquoclosestrsquorsquo clusterSpecifically we compute the Euclidean distancebetween the loan data point and the center ofeach cluster We then assign the loan to the clus-ter with the smallest distance

4 The standard deviation of the return for each loanin the test set would then be approximated by thestandard deviation of its cluster

For each loan i we denote by s(i) its cluster label andby rs(i) the standard deviation of return One possibleformulation is to maximize the total revenue while

210 COHEN ET AL

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 21: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

imposing a standard deviation tolerance G (note thatwe assume independence across the different loans)

maxx

+i

xiAiri

st +i

xiAi L

+i

xirs(i) G

+i

xi N

+i

xi 09N

xi 2 f0 1g

This problem is an instance of a multi-knapsack prob-lem An alternative formulation is a Markowitz-typemodel which seeks to balance the riskndashreturn trade-off

maxx

+i

xi ri brs(i)eth THORNAi

st +i

xiAi L

+i

xi N

+i

xi 09N

xi 2 f0 1g

Here b is the sensitivity parameter that is manually setby the investor depending on the risk aversion toler-ated As before several constraints can be incorporatedinto the formulation (eg diversification constraintsthat impose selecting loans of different grades) Bycarefully tuning the different parameters this modelyields a return of 321 Since extensive tuning is re-quired for the optimization model we do not includebootstrap simulation results here (this could be doneas an optional exercise)

In this case study under the M1-PESS return defini-tion using machine learning tools allowed us to obtaina return of 3 (using the DefRet strategy) By combin-ing machine learning and optimization we could po-tentially increase our return to 321 that is a 7relative improvement

Looking ForwardAs discussed at the outset this case study is intended tosupport learning about data science not to present thebest way to solve this problem On the website for thecase we invite instructors and other readers to discussalternatives and we will continue to add to the accom-panying notebooks as different solutions twists andturns reveal themselves The website will also containa version of the case without the teaching notes incase instructors do not want students to see the lsquolsquosolu-tionsrsquorsquo immediately

Over the past two decades the material available forteaching data science has improved substantially How-ever there is still a lack of real case studies designed forteaching data science that work all the way from pro-curing the data to solving a real problem We hope thatthe work we have done to make this case study avail-able to instructors students and professionals will mo-tivate the creation and dissemination of other similarcase studies

AcknowledgmentsOur sincere thanks go to Marcela Barrios Brian Dales-sandro Tom Fawcett and Claudia Perlich for detailedcomments on prior drafts of this case study

Author Disclosure StatementNo competing financial interests exist Foster Provost iscoauthor of the book Data Science for Business

References1 Provost F Fawcett T Data Science for Business What you need to know

about data mining and data-analytic thinking Sebastopol CA OrsquoReillyMedia Inc 2013

2 McKinney W Python for data analysis Data wrangling with PandasNumPy and IPython Sebastopol CA OrsquoReilly Media Inc 2012

3 Raschka S Mirjalili V Python machine learning Birmingham UK PacktPublishing Ltd 2017

4 Boyd S Vandenberghe L Convex optimization Cambridge UK CambridgeUniversity Press 2004

5 Gurobi Optimization Inc 2014 Gurobi Optimizer Reference Manual 2015Available online at www gurobicom (last accessed July 1 2018)

Cite this article as Cohen MC Guetta CD Jiao K Provost F (2018)Data-driven investment strategies for peer-to-peer lending a casestudy for teaching data science Big Data 63 191ndash213 DOI 101089big20180092

Abbreviations UsedAUC frac14 area under the ROC curveCSV frac14 comma separated valuesSEC frac14 Securities and Exchange Commission

Appendix Data for the Case and NotebooksIn this section we briefly describe the data that can beused to reproduce the analyses above as well as the note-books used throughout The notebooks are available fromthe companion website for this case (guettacomlc_case)

DataAs mentioned before LendingClub makes past loandata freely available for download on its website

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 211

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 22: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

Instructors may download the most recent data herehttpswwwlendingclubcominfodownload-dataaction The loans are organized in multiple files oneper quarter (by date of loan issuance) and can bedownloaded in compressed CSV format Certain vari-ables are only available to logged in LendingClub users

As discussed at length in the body of the case thesefiles are constructed with the most recent data availableFor example the data contain information about thenumber of overdue accounts in the applicantrsquos namemdashif new accounts have become overdue after the loanwas issued they will appear in the file The case con-tains analysis that determines which attributes arestatic and which attributes are updated over timeHowever the analysis we used required old files down-loaded before the current file to detect such changesThese old files are not available for download fromLendingClubmdashthe only way to obtain them is to down-load and then wait for a new version The authorsrsquo copyof the data files can be provided see the website fordetails

A small part of the case requires one additional dataset As we noted the data set contains both a total_-pymnt attribute and a recoveries attribute To ensurethat the former includes the payments detailed in thelatter we need a detailed history of every paymentmade for each loan These are available to logged inusers from LendingClub at httpswwwlendingclubcomcompanyadditional-statistics

The Ingestion_Cleaning NotebookThe first notebook provided with this case is concernedwith the ingestion of the data from LendingClub andtheir cleaning

The steps in this notebook are as follows

We begin by defining the parameters of the inges-tionmdashin particular specifying the attributes wewill be reading from the files as well as those cor-responding to continuous and discrete features We ingest the data directly from the compressed

CSV files downloaded from LendingClub (boththose most recently downloaded in 2018 andthose downloaded earlier in 2017) and carryout consistency checks to ensure that the files con-tain the same set of attributes and disjoint sets ofloan IDs We then combine the files into two mas-ter dataframesmdashone for the data downloaded in2017 and one for the data downloaded in 2018All the data are ingested as strings to ensure Pan-

das does not incorrectly determine the type of anyattribute they will be typecast later We carry out analysis to verify which attributes

are static (ie are not updated over time) andwhich are dynamic We do this by comparingthe 2018 and 2017 versions of the files We typecast each of the attributes We calculate the return for each loan using the

methods described in the case study We visualize each variable and remove any obvi-

ous outliers We perform a more thorough analysis of the full

payment data to verify that the total_pymnt attri-bute includes the payments recorded in the recov-eries attribute

The Modeling NotebookThe second notebook provided carries out the model-ing steps in this case

The notebook contains an integer variable calleddefault_seed This is used as the seed to split the datainto training and test sets and wherever randomnessis required in the building of our models Each timethe notebook runs it logs the results of every operationto an output file (we describe this process in greater de-tail below) The various bootstrap estimates obtained inthis case study are a result of running this notebookwith many different seeds

The first part of our notebook performs two mainfunctions

We split our data set into a training set and a testset We create these training and test sets beforeeven beginning our study to ensure that they re-main identical for every model and technique weinvestigate so as not to introduce any additionalvariance across models Every modeling step wecarry out below will use the trainingtest split gen-erated here although they might be downsampledto ensure faster runtimes We create features from the data cleaned in the

ingestion_cleaning notebookmdashin particular wesplit categorical variables into dummy variables

The next and most substantial part of the notebookdefines four functions prepare_data this function creates downsampled

training and test sets Here we also perform a[01] normalization of all the attributes in thedata (using the MinMaxScaler method) It returnsa dictionary and every other function expects adictionary of this format as input

212 COHEN ET AL

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213

Page 23: Data-Driven Investment Strategies for Peer-to-Peer Lending ...pages.stern.nyu.edu/~mcohen2/BigDataP2P.pdf · ORIGINAL ARTICLE Data-Driven Investment Strategies for Peer-to-Peer Lending:

fit_classification this function accepts a down-sampled training and test set a sklearn classificationmodel object and a grid of parameters to try for themodel object It fits a model using that object on thetraining set uses cross-validation to pick the bestparameter in the grid-and evaluates it on the restof the data It then outputs a number of diagnosticsabout the fit It finally returns the fitted model

This function logs the performance of themodel as described above through the dump_to_output function In addition to the perfor-mance of the model it also logs whether the op-timal set of parameters was on the edge of theparameter grid for error-checking purposes

fit_regression same as above for regression modelsIt returns three fitted modelsmdashone on all loans oneon defaulted loans and one on nondefaulted loans

test_investments this function implements the in-vestment strategies defined in this case study It ex-pects a model returned by one of the two functionsabove

The remaining parts of the notebook test varioussklearn models using the functions above and thentest various investment strategies It also carries outthe various tests described in the case study in regardto stability over time learning rates and so on

The modeling_leakage NotebookThis notebook illustrates the result of fitting a modelusing all attributes including those updated over timeby LendingClub As mentioned in the body of thecase this results in a much highermdashbut misleadingmdashAUC (area under the ROC curve)

DATA-DRIVEN INVESTMENT STRATEGIES FOR PEER-TO-PEER LENDING 213