machine learning for fraud detection

Bigger Data. Better Results.™

Machine Learning for Fraud Detec3on

Nitesh Kumar, PhD [email protected]


Who Am I?

•  Applied Math PhD

•  Deriva3ve/ Op3ons Pricing Background

•  7 years doing analy3cs

•  Data Science at Skytree for 2 years


Skytree Inc.

•  Came out of Alex Gray’s (CTO) FastLab @ Georgia Tech

•  SoTware Company that provides Machine Learning SoTware

•  Built to func3on on top of Hadoop

•  Automa3on, speed, and scalability

•  User can interact through command line interface, APIs, and GUI

•  20 million dollars in series A

•  TAB: Michael Jordan, James Demmel, Dave Pa[erson, Pat Hanrahan

What is Skytree?

•  Machine Learning Pla\orm GBM, K-‐means, RF, SVD/ PCA, Linear/ Logis3c, SVM, collabora3ve filtering etc.

•  Built for Big Data Scales linearly with data size and compute nodes (map-‐reduce, hadoop)

•  Usability SDK in Python, Java, REST, even GUI

Data prepara3on through Spark

•  Automa3on 1-‐click modeling

•  ML on Bigger Data produces Be[er Results Larger datasets lead to higher accuracy


Outline

•  Introduc3on Why Skytree, Big Data, and Machine Learning for Fraud?

•  Machine Learning in Financial Services Issues, methods, and solu3on

•  Live Demo of Skytree on real-‐world dataset (command line, API, GUI) Time and setup permidng


Introduc3on

•  Fraud is a Big problem (Big Data, Big Cost)

•  Why is Machine Learning necessary?

•  Comprehensive solu3on?

Fraud is a Big Data Problem

•  “More than 23 billion credit card transac3ons are processed annually in USA” CreditCards.com

•  Credit card transac3on alone generates mul3ple Terabytes of data a year

•  Each transac3on has 100-‐300 a[ributes

•  Distributed data across mul3ple nodes

Fraud is a Big Cost Problem

•  “Businesses lose an es3mated $3.5 billion annually to fraud and financial crime.” Forbes, 2014

•  “Total value of credit card transac3ons in the U.S. in 2012: $2.48 trillion” CreditCard.com h[p://www.federalreserve.gov/releases/g19/Current/

Why Machine Learning?

•  Tradi3onal ideas of finding pa[erns through hand craTed, careful querying, does

not scale to large datasets

•  Prior rule based engines do not make use of informa3on from mul3ple a[ributes at

the same 3me

•  Machine Learning concerns with algorithms that can learn from data Mul3variate Sta3s3cs

Automated predic3ve analy3cs

•  Even a 3ny increase in accuracy can lead to millions of dollars in savings

Gap between Machine Learning and Big Data

Ø  Awakening to

Big Data, experimen3ng with ML?

Ø ML is necessary to derive value out of Big Data

ML on Bigger Data produces Be[er Results •  Weak and Strong Law of Large numbers

•  “We have shown that for a prototypical natural language classifica3on task, the

performance of learners can benefit significantly from much larger training sets.”

Banco and Brill, Proceedings of ACL, 2001.

•  “Breiman’s procedure (random forest) is consistent and adapts to sparsity, in the

sense that its rate of convergence depends only on the number of strong features

and not on how many noise variables are present.” Gerard Biau, JMLR, 2012

•  Some%mes Big Data is all you need!

Experiment: ML on Bigger Data produces Be[er Results

•  Source dataset: DNA dataset from Pascal Large Scale Learning Challenge.

•  A 4M-‐row dataset was held out for tes3ng. Training datasets with 20M, 40M, 80M, 160M, 320M, 640M, 5120M elements, arranged into 200 columns, were used. No featuriza3on was applied.

•  Op3mal model for each training dataset size was found by tuning Gradient Boos3ng Machine on a holdout dataset with Skytree smart-‐search.

•  AUC (Area under ROC curve) was used for evalua3on.

•  Experiment by Skytree Inc, 2015

Bigger Data, Be[er Results on Real World Data

Dataset Size AUC

20,000,000 93.9%

40,000,000 95.0%

80,000,000 95.6%

160,000,000 96.2%

320,000,000 96.7%

640,000,000 97.2%

5,120,000,000 98.1%

Machine Learning Solu3on for Financial Services Mul3ple algorithms for higher accuracy • Gradient Boos3ng • Random Decision Forest • SVM • Stacked models (combined models) • Mixed models (combine supervised and unsupervised models)

Automa3c Parameter Selec3on • Automa3cally create best performing model for any algorithm in fewer itera3on • Allow for usage by domain experts (non data scien3sts) • Higher Accuracy machine can tune be[er than humans

Speed and Scalability

• Big Data scale • Catch latest trends in fraud • Improve accuracy • Iterate over mul3ple algorithms and parameters • Faster model crea3on and model update

Visualiza3on and Op3miza3on

• Op3mize directly for dollars

• Visualize model performance

• Provide knobs to choose a model

• Ensure op3mality of models without over fidng

• Visualize models to interpret results


Machine Learning for Fraud Detec3on

•  Countering Fraud is a Machine Learning Problem

•  Challenges

•  Solu3on (GBM and advanced)

Fraud Detec3on

•  Counter complex and transient fraud pa[erns

•  Analyze mul3ple and large datasets to discover and predict fraud “More than 23 billion credit card transac3ons are processed annually in USA” CreditCards.com

Machine Learning Problem Supervised Learning: Predict Fraud

Collect historical transac3ons

Learn from past examples of fraud

Predict fraud (in real-‐3me)

Unsupervised Learning: Discover Fraud

Segment transac3ons

Inves3gate poten3ally new fraud

Detect Outliers

Mixed Approach: Discover and predict Fraud

Detect “Points of Compromise” to prevent fraud

Common Issues

•  Imbalanced Datasets Too few examples of ‘known’ fraud

•  What to op3mize? Fraud capture rate

False posi3ve rate: what is the cost associated?

Total loss incurred due to fraud

What loss func3on to use

•  How to handle missing values?

•  Which algorithm to use?

[Current] Industry Standard Solu3on

GBM algorithm (Friedman, 2001 and variants)

•  Sequen3ally combines simple models, with each “new” model correc3ng the mistakes of the

previous ones

•  Base Model in this case is decision trees

•  Inspired by gradient descent in op3miza3on

GBM Pros

•  Automa3cally handles missing values

•  Highly accurate models

•  Captures nonlinearity in the data

•  Does not require deep understanding of the data

GBM Cons

•  Does not handle datasets with high dimensions well

•  Minimizes bias, not necessarily variance

•  Chance of over fidng the training data when data is noisy

•  Not the best at handling very high imbalance in the data

•  Requires extensive parameter tuning

•  Not simple to distribute

GBM: overcoming the odds •  Does not handle datasets with high dimensions well

•  SVMs handle datasets with high dimensionality

•  Minimizes bias, not necessarily variance •  Ensemble of GBM (eGBM, Skytree, 2013) and stochas3c GBM (sGBM)

•  eGBM: Idea is to use ensembles of GBMs where each GBM is built using bootstrap

samples

•  sGBM: Each base learner (decision tree) uses different samples

•  Mixed Models

•  Combine Linear/ Logis3c models with GBM by blending/ stacking

•  High chance of over fidng the training data •  Carefully check for generaliza3on error

•  Restrict to simple base learners (shallow decision trees) etc.

GBM: overcoming the odds

•  Not the best at handling very high imbalance in the data •  Ensemble GBMs, stochas3c GBMs, Random Forests etc.

•  Requires extensive parameter tuning •  Smart-‐Search (Skytree Inc.,2014)

•  Patent-‐pending technology

•  Op3miza3on that itera3vely learns from the previous itera3ons

•  Successively improves the space in which to search for the best solu3on

•  Faster way to obtain the op3mal set of parameters

•  Not simple to distribute

•  Bring High Performance Compu3ng (HPC) distribu3ng

Machine Learning Solu3on for Financial Services Mul3ple algorithms for higher accuracy • Gradient Boos3ng • Random Decision Forest • SVM • Stacked models (combined models) • Mixed models (combine supervised and unsupervised models)

Automa3c Parameter Selec3on • Automa3cally create best performing model for any algorithm in fewer itera3on • Allow for usage by domain experts (non data scien3sts) • Higher Accuracy machine can tune be[er than humans

Speed and Scalability

• Big Data scale • Catch latest trends in fraud • Improve accuracy • Iterate over mul3ple algorithms and parameters • Faster model crea3on and model update

Visualiza3on and Op3miza3on

• Op3mize directly for dollars

• Visualize model performance

• Provide knobs to choose a model

• Ensure op3mality of models without over fidng

• Visualize models to interpret results


Lets see how it works!

•  Skytree Workspace

•  Demo

•  CLI

•  Python SDK

•  GUI

Unified Data Scien3st Workspace

machine learning for fraud detection

Technology

bigger data

big data problem

data size

big problem big data

big data scales

mes big data

gui data prepara3on

mul3ple terabytes of