2015 feb 24_paytm_labs_intro_ashwin_armandoadam

28
Data Ashwin Tumne

Upload: adam-muise

Post on 14-Jul-2015

761 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

DataAshwin Tumne

Page 2: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam
Page 3: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Node.js +

RabbitMQ

Page 4: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Node.js +

RabbitMQ

Data SciencePipeline

1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition

Page 5: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Node.js +

RabbitMQ

Data SciencePipeline

1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition

Theory (Math, Algorithms)

Proof-of-Concept (R, Python, Scala, C++)

Spark Implementation (Scalability, Robustness)

Platform Integration

Page 6: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Node.js +

RabbitMQ

Data SciencePipeline

1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition

Theory (Math, Algorithms)

Proof-of-Concept (R, Python, Scala, C++)

Spark Implementation (Scalability, Robustness)

Platform Integration

?

Page 7: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Transaction grade

APIs + MQs

Data Lake

HBase, Cassandra, OceanBase,

etc.

Stream Processing

Batch Processing

Model Generator

Decision Engine

(context, event, data)

(event)(data)

Feature Selection

Model Training

Model Evaluation

Model Assembly

Real-Time Layer Batch Processing Layer

{

Data SciencePipeline

1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition

Theory (Math, Algorithms)

Proof-of-Concept (R, Python, Scala, C++)

Spark Implementation (Scalability, Robustness)

Platform Integration

Page 8: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Transaction grade

APIs + MQs

Data Lake

Stream Processing

Batch Processing

Model Generator

Decision Engine

(context, event, data)

(event)(data)

Feature Selection

Model Training

Model Evaluation

Model Assembly

Real-Time Layer Batch Processing Layer

{

Data SciencePipeline

1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition

Theory (Math, Algorithms)

Proof-of-Concept (R, Python, Scala, C++)

Spark Implementation (Scalability, Robustness)

Platform Integration

DevOps !!!

HBase, Cassandra, OceanBase,

etc.

Page 9: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Data Science for Fraud Detection

Armando Benitez - @jabenitez - @paytmlabs

Page 10: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Supervised learning vs Anomaly detection๏ Very small number of positive

examples ๏ Large number of negative examples. ๏ Many different “types” of anomalies.

Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

10

๏ Ideally large number of positive and negative examples.

๏ Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

* Anomaly Detection - Andrew Ng - Coursera ML Course

Page 11: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

What approach to follow?๏ Not so good: One model to rule them all ๏ Better:

๏ Many models competing against each other

๏ 100s or 1000s of rules running in parallel

๏ Know thy customer

11

Page 12: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Feature Selection๏ Want p(x) large (small) for normal examples,

p(x) small (large) for anomalous examples ๏ Most common problem:

comparable distributions for both normal and anomalous examples ๏ Possible solutions:

๏ Apply transformation and variable combinations:

๏ xn+1 = ( x1 + x4 ) 2

/ x3

๏ Focus on variable ratios and transaction velocity

๏ Use deep learning for feature extraction

๏ Dimensionality reduction

๏ your solution here

12

Page 13: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Feature Selection

13

Page 14: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Feature Selection

14

Variable X

Coun

ts

BKGSIG

Page 15: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

What have we have tried๏ Density estimator ๏ 2D Profiles ๏ Anomaly detection ๏ Clustering ๏ Model ensemble (Random forest) ๏ Deep learning (RBM) ๏ Logistic Regression

15

Combine

Page 16: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Gaussian distribution

16

Page 17: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Anomaly Detection* - Example๏ Choose features, xi , that are indicative of anomalous examples. ๏ Fit parameters to a normal distribution ๏ Given new example, compute:

๏ Anomaly if

17

* Anomaly Detection - Andrew Ng - Coursera ML Course

Page 18: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Algorithm Evaluation๏ Fit model on training set ๏ On a cross validation/test example, predict ๏ Possible evaluation metrics:

๏ True positive, false positive, false negative, true negative

๏ Precision/Recall

๏ F1-score

18

Page 19: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Implementation

19

Page 20: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Extra Slides

Page 21: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Anomaly Detection*

21

* Anomaly Detection - Andrew Ng - Coursera ML Course

Cross validation set:

Test set:

Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous.Training set: (assume normal examples/not anomalous)

Page 22: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Transform, Normalize, Calculate

22

Page 23: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Scala

23

Page 24: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Creating Scalable Architecture

Futures

Page 25: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

The lake again

25

Lake Simcoe going on

Lake Superior

Classic Lambda Architecture

Various Processing Frameworks

Near-Realtime Scoring/Alerting*

Page 26: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Fraud Capabilities and Technology

A. Batch Ingest and Analysis of transaction data from Database

B. Batch Behavioural and Portfolio heuristic fraud detection

C. Near-realtime anomaly and heuristic fraud detection

D. Online Model Scoring

26

A. Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing

B. Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage

C. Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup

D. JPMML/Spark Streaming for realtime model scoring

Page 27: 2015 feb 24_paytm_labs_intro_ashwin_armandoadam

[email protected] - @jabenitez

Our framework shopping list

27

iPython & Scala

Notebooks

Explore & Train Ingest, Store, Score, & Act

Spark ::Core ::MLLib

::Streaming ::GraphX?

Intercept with Storm?

Spark Streaming?

Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3

OpenScoring?

JPMML? R?