2015 feb 24_paytm_labs_intro_ashwin_armandoadam
TRANSCRIPT
DataAshwin Tumne
Node.js +
RabbitMQ
Node.js +
RabbitMQ
Data SciencePipeline
1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition
Node.js +
RabbitMQ
Data SciencePipeline
1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition
Theory (Math, Algorithms)
Proof-of-Concept (R, Python, Scala, C++)
Spark Implementation (Scalability, Robustness)
Platform Integration
Node.js +
RabbitMQ
Data SciencePipeline
1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition
Theory (Math, Algorithms)
Proof-of-Concept (R, Python, Scala, C++)
Spark Implementation (Scalability, Robustness)
Platform Integration
?
Transaction grade
APIs + MQs
Data Lake
HBase, Cassandra, OceanBase,
etc.
Stream Processing
Batch Processing
Model Generator
Decision Engine
(context, event, data)
(event)(data)
Feature Selection
Model Training
Model Evaluation
Model Assembly
Real-Time Layer Batch Processing Layer
{
Data SciencePipeline
1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition
Theory (Math, Algorithms)
Proof-of-Concept (R, Python, Scala, C++)
Spark Implementation (Scalability, Robustness)
Platform Integration
Transaction grade
APIs + MQs
Data Lake
Stream Processing
Batch Processing
Model Generator
Decision Engine
(context, event, data)
(event)(data)
Feature Selection
Model Training
Model Evaluation
Model Assembly
Real-Time Layer Batch Processing Layer
{
Data SciencePipeline
1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition
Theory (Math, Algorithms)
Proof-of-Concept (R, Python, Scala, C++)
Spark Implementation (Scalability, Robustness)
Platform Integration
DevOps !!!
HBase, Cassandra, OceanBase,
etc.
Data Science for Fraud Detection
Armando Benitez - @jabenitez - @paytmlabs
[email protected] - @jabenitez
Supervised learning vs Anomaly detection๏ Very small number of positive
examples ๏ Large number of negative examples. ๏ Many different “types” of anomalies.
Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.
10
๏ Ideally large number of positive and negative examples.
๏ Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
* Anomaly Detection - Andrew Ng - Coursera ML Course
[email protected] - @jabenitez
What approach to follow?๏ Not so good: One model to rule them all ๏ Better:
๏ Many models competing against each other
๏ 100s or 1000s of rules running in parallel
๏ Know thy customer
11
[email protected] - @jabenitez
Feature Selection๏ Want p(x) large (small) for normal examples,
p(x) small (large) for anomalous examples ๏ Most common problem:
comparable distributions for both normal and anomalous examples ๏ Possible solutions:
๏ Apply transformation and variable combinations:
๏ xn+1 = ( x1 + x4 ) 2
/ x3
๏ Focus on variable ratios and transaction velocity
๏ Use deep learning for feature extraction
๏ Dimensionality reduction
๏ your solution here
12
[email protected] - @jabenitez
What have we have tried๏ Density estimator ๏ 2D Profiles ๏ Anomaly detection ๏ Clustering ๏ Model ensemble (Random forest) ๏ Deep learning (RBM) ๏ Logistic Regression
15
Combine
[email protected] - @jabenitez
Anomaly Detection* - Example๏ Choose features, xi , that are indicative of anomalous examples. ๏ Fit parameters to a normal distribution ๏ Given new example, compute:
๏ Anomaly if
17
* Anomaly Detection - Andrew Ng - Coursera ML Course
[email protected] - @jabenitez
Algorithm Evaluation๏ Fit model on training set ๏ On a cross validation/test example, predict ๏ Possible evaluation metrics:
๏ True positive, false positive, false negative, true negative
๏ Precision/Recall
๏ F1-score
18
Extra Slides
[email protected] - @jabenitez
Anomaly Detection*
21
* Anomaly Detection - Andrew Ng - Coursera ML Course
Cross validation set:
Test set:
Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous.Training set: (assume normal examples/not anomalous)
Creating Scalable Architecture
Futures
[email protected] - @jabenitez
The lake again
25
Lake Simcoe going on
Lake Superior
Classic Lambda Architecture
Various Processing Frameworks
Near-Realtime Scoring/Alerting*
[email protected] - @jabenitez
Fraud Capabilities and Technology
A. Batch Ingest and Analysis of transaction data from Database
B. Batch Behavioural and Portfolio heuristic fraud detection
C. Near-realtime anomaly and heuristic fraud detection
D. Online Model Scoring
26
A. Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing
B. Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage
C. Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup
D. JPMML/Spark Streaming for realtime model scoring
[email protected] - @jabenitez
Our framework shopping list
27
iPython & Scala
Notebooks
Explore & Train Ingest, Store, Score, & Act
Spark ::Core ::MLLib
::Streaming ::GraphX?
Intercept with Storm?
Spark Streaming?
Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3
OpenScoring?
JPMML? R?