machine learning for fraud detection
TRANSCRIPT
Bigger Data. Better Results.™
Machine Learning for Fraud Detec3on
Nitesh Kumar, PhD [email protected]
Bigger Data. Better Results.™
Who Am I?
• Applied Math PhD
• Deriva3ve/ Op3ons Pricing Background
• 7 years doing analy3cs
• Data Science at Skytree for 2 years
Bigger Data. Better Results.™
Skytree Inc.
• Came out of Alex Gray’s (CTO) FastLab @ Georgia Tech
• SoTware Company that provides Machine Learning SoTware
• Built to func3on on top of Hadoop
• Automa3on, speed, and scalability
• User can interact through command line interface, APIs, and GUI
• 20 million dollars in series A
• TAB: Michael Jordan, James Demmel, Dave Pa[erson, Pat Hanrahan
What is Skytree?
• Machine Learning Pla\orm GBM, K-‐means, RF, SVD/ PCA, Linear/ Logis3c, SVM, collabora3ve filtering etc.
• Built for Big Data Scales linearly with data size and compute nodes (map-‐reduce, hadoop)
• Usability SDK in Python, Java, REST, even GUI
Data prepara3on through Spark
• Automa3on 1-‐click modeling
• ML on Bigger Data produces Be[er Results Larger datasets lead to higher accuracy
Bigger Data. Better Results.™
Outline
• Introduc3on Why Skytree, Big Data, and Machine Learning for Fraud?
• Machine Learning in Financial Services Issues, methods, and solu3on
• Live Demo of Skytree on real-‐world dataset (command line, API, GUI) Time and setup permidng
Bigger Data. Better Results.™
Introduc3on
• Fraud is a Big problem (Big Data, Big Cost)
• Why is Machine Learning necessary?
• Comprehensive solu3on?
Fraud is a Big Data Problem
• “More than 23 billion credit card transac3ons are processed annually in USA” CreditCards.com
• Credit card transac3on alone generates mul3ple Terabytes of data a year
• Each transac3on has 100-‐300 a[ributes
• Distributed data across mul3ple nodes
Fraud is a Big Cost Problem
• “Businesses lose an es3mated $3.5 billion annually to fraud and financial crime.” Forbes, 2014
• “Total value of credit card transac3ons in the U.S. in 2012: $2.48 trillion” CreditCard.com h[p://www.federalreserve.gov/releases/g19/Current/
Why Machine Learning?
• Tradi3onal ideas of finding pa[erns through hand craTed, careful querying, does
not scale to large datasets
• Prior rule based engines do not make use of informa3on from mul3ple a[ributes at
the same 3me
• Machine Learning concerns with algorithms that can learn from data Mul3variate Sta3s3cs
Automated predic3ve analy3cs
• Even a 3ny increase in accuracy can lead to millions of dollars in savings
Gap between Machine Learning and Big Data
Ø Awakening to
Big Data, experimen3ng with ML?
Ø ML is necessary to derive value out of Big Data
ML on Bigger Data produces Be[er Results • Weak and Strong Law of Large numbers
• “We have shown that for a prototypical natural language classifica3on task, the
performance of learners can benefit significantly from much larger training sets.”
Banco and Brill, Proceedings of ACL, 2001.
• “Breiman’s procedure (random forest) is consistent and adapts to sparsity, in the
sense that its rate of convergence depends only on the number of strong features
and not on how many noise variables are present.” Gerard Biau, JMLR, 2012
• Some%mes Big Data is all you need!
Experiment: ML on Bigger Data produces Be[er Results
• Source dataset: DNA dataset from Pascal Large Scale Learning Challenge.
• A 4M-‐row dataset was held out for tes3ng. Training datasets with 20M, 40M, 80M, 160M, 320M, 640M, 5120M elements, arranged into 200 columns, were used. No featuriza3on was applied.
• Op3mal model for each training dataset size was found by tuning Gradient Boos3ng Machine on a holdout dataset with Skytree smart-‐search.
• AUC (Area under ROC curve) was used for evalua3on.
• Experiment by Skytree Inc, 2015
Bigger Data, Be[er Results on Real World Data
Dataset Size AUC
20,000,000 93.9%
40,000,000 95.0%
80,000,000 95.6%
160,000,000 96.2%
320,000,000 96.7%
640,000,000 97.2%
5,120,000,000 98.1%
Machine Learning Solu3on for Financial Services Mul3ple algorithms for higher accuracy • Gradient Boos3ng • Random Decision Forest • SVM • Stacked models (combined models) • Mixed models (combine supervised and unsupervised models)
Automa3c Parameter Selec3on • Automa3cally create best performing model for any algorithm in fewer itera3on • Allow for usage by domain experts (non data scien3sts) • Higher Accuracy machine can tune be[er than humans
Speed and Scalability
• Big Data scale • Catch latest trends in fraud • Improve accuracy • Iterate over mul3ple algorithms and parameters • Faster model crea3on and model update
Visualiza3on and Op3miza3on
• Op3mize directly for dollars
• Visualize model performance
• Provide knobs to choose a model
• Ensure op3mality of models without over fidng
• Visualize models to interpret results
Bigger Data. Better Results.™
Machine Learning for Fraud Detec3on
• Countering Fraud is a Machine Learning Problem
• Challenges
• Solu3on (GBM and advanced)
Fraud Detec3on
• Counter complex and transient fraud pa[erns
• Analyze mul3ple and large datasets to discover and predict fraud “More than 23 billion credit card transac3ons are processed annually in USA” CreditCards.com
Machine Learning Problem Supervised Learning: Predict Fraud
Collect historical transac3ons
Learn from past examples of fraud
Predict fraud (in real-‐3me)
Unsupervised Learning: Discover Fraud
Segment transac3ons
Inves3gate poten3ally new fraud
Detect Outliers
Mixed Approach: Discover and predict Fraud
Detect “Points of Compromise” to prevent fraud
Common Issues
• Imbalanced Datasets Too few examples of ‘known’ fraud
• What to op3mize? Fraud capture rate
False posi3ve rate: what is the cost associated?
Total loss incurred due to fraud
What loss func3on to use
• How to handle missing values?
• Which algorithm to use?
[Current] Industry Standard Solu3on
GBM algorithm (Friedman, 2001 and variants)
• Sequen3ally combines simple models, with each “new” model correc3ng the mistakes of the
previous ones
• Base Model in this case is decision trees
• Inspired by gradient descent in op3miza3on
GBM Pros
• Automa3cally handles missing values
• Highly accurate models
• Captures nonlinearity in the data
• Does not require deep understanding of the data
GBM Cons
• Does not handle datasets with high dimensions well
• Minimizes bias, not necessarily variance
• Chance of over fidng the training data when data is noisy
• Not the best at handling very high imbalance in the data
• Requires extensive parameter tuning
• Not simple to distribute
GBM: overcoming the odds • Does not handle datasets with high dimensions well
• SVMs handle datasets with high dimensionality
• Minimizes bias, not necessarily variance • Ensemble of GBM (eGBM, Skytree, 2013) and stochas3c GBM (sGBM)
• eGBM: Idea is to use ensembles of GBMs where each GBM is built using bootstrap
samples
• sGBM: Each base learner (decision tree) uses different samples
• Mixed Models
• Combine Linear/ Logis3c models with GBM by blending/ stacking
• High chance of over fidng the training data • Carefully check for generaliza3on error
• Restrict to simple base learners (shallow decision trees) etc.
GBM: overcoming the odds
• Not the best at handling very high imbalance in the data • Ensemble GBMs, stochas3c GBMs, Random Forests etc.
• Requires extensive parameter tuning • Smart-‐Search (Skytree Inc.,2014)
• Patent-‐pending technology
• Op3miza3on that itera3vely learns from the previous itera3ons
• Successively improves the space in which to search for the best solu3on
• Faster way to obtain the op3mal set of parameters
• Not simple to distribute
• Bring High Performance Compu3ng (HPC) distribu3ng
Machine Learning Solu3on for Financial Services Mul3ple algorithms for higher accuracy • Gradient Boos3ng • Random Decision Forest • SVM • Stacked models (combined models) • Mixed models (combine supervised and unsupervised models)
Automa3c Parameter Selec3on • Automa3cally create best performing model for any algorithm in fewer itera3on • Allow for usage by domain experts (non data scien3sts) • Higher Accuracy machine can tune be[er than humans
Speed and Scalability
• Big Data scale • Catch latest trends in fraud • Improve accuracy • Iterate over mul3ple algorithms and parameters • Faster model crea3on and model update
Visualiza3on and Op3miza3on
• Op3mize directly for dollars
• Visualize model performance
• Provide knobs to choose a model
• Ensure op3mality of models without over fidng
• Visualize models to interpret results
Bigger Data. Better Results.™
Lets see how it works!
• Skytree Workspace
• Demo
• CLI
• Python SDK
• GUI
Unified Data Scien3st Workspace