2015 feb 24_paytm_labs_intro_ashwin_armandoadam

DataAshwin Tumne

Node.js +

RabbitMQ

Node.js +

RabbitMQ

Data SciencePipeline

1. Fraud Detection 2. Search 3. Recommendations 4. Notifications 5. Ratings 6. Merchant Intelligence 7. Engagement Optimization 8. Marketing Optimization 9. App Personalization 10. Ad Network Support 11. Image / Speech Recognition

Node.js +

RabbitMQ



Theory (Math, Algorithms)

Proof-of-Concept (R, Python, Scala, C++)

Spark Implementation (Scalability, Robustness)

Platform Integration

Node.js +

RabbitMQ







?

Transaction grade

APIs + MQs

Data Lake

HBase, Cassandra, OceanBase,

etc.

Stream Processing

Batch Processing

Model Generator

Decision Engine

(context, event, data)

(event)(data)

Feature Selection

Model Training

Model Evaluation

Model Assembly

Real-Time Layer Batch Processing Layer

{







Transaction grade

APIs + MQs

Data Lake

Stream Processing

Batch Processing

Model Generator

Decision Engine

(context, event, data)

(event)(data)

Feature Selection

Model Training

Model Evaluation

Model Assembly

Real-Time Layer Batch Processing Layer

{







DevOps !!!

HBase, Cassandra, OceanBase,

etc.

Data Science for Fraud Detection

Armando Benitez - @jabenitez - @paytmlabs

[email protected] - @jabenitez

Supervised learning vs Anomaly detection๏ Very small number of positive

examples ๏ Large number of negative examples. ๏ Many different “types” of anomalies.

Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

10

๏ Ideally large number of positive and negative examples.

๏ Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

* Anomaly Detection - Andrew Ng - Coursera ML Course

mailto:[email protected]


What approach to follow?๏ Not so good: One model to rule them all ๏ Better:

๏ Many models competing against each other

๏ 100s or 1000s of rules running in parallel

๏ Know thy customer

11



Feature Selection๏ Want p(x) large (small) for normal examples,

p(x) small (large) for anomalous examples ๏ Most common problem:

comparable distributions for both normal and anomalous examples ๏ Possible solutions:

๏ Apply transformation and variable combinations:

๏ xn+1 = ( x1 + x4 ) 2

/ x3

๏ Focus on variable ratios and transaction velocity

๏ Use deep learning for feature extraction

๏ Dimensionality reduction

๏ your solution here

12



Feature Selection

13



Feature Selection

14

Variable X

Coun

ts

BKGSIG



What have we have tried๏ Density estimator ๏ 2D Profiles ๏ Anomaly detection ๏ Clustering ๏ Model ensemble (Random forest) ๏ Deep learning (RBM) ๏ Logistic Regression

15

Combine



Gaussian distribution

16



Anomaly Detection* - Example๏ Choose features, xi , that are indicative of anomalous examples. ๏ Fit parameters to a normal distribution ๏ Given new example, compute:

๏ Anomaly if

17




Algorithm Evaluation๏ Fit model on training set ๏ On a cross validation/test example, predict ๏ Possible evaluation metrics:

๏ True positive, false positive, false negative, true negative

๏ Precision/Recall

๏ F1-score

18



Implementation

19


Extra Slides


Anomaly Detection*

21


Cross validation set:

Test set:

Assume we have some labeled data, of anomalous and non-anomalous examples: y = 0 if standard behaviour, . y = 1 if anomalous.Training set: (assume normal examples/not anomalous)



Transform, Normalize, Calculate

22



Scala

23


Creating Scalable Architecture

Futures


The lake again

25

Lake Simcoe going on

Lake Superior

Classic Lambda Architecture

Various Processing Frameworks

Near-Realtime Scoring/Alerting*



Fraud Capabilities and Technology

A. Batch Ingest and Analysis of transaction data from Database

B. Batch Behavioural and Portfolio heuristic fraud detection

C. Near-realtime anomaly and heuristic fraud detection

D. Online Model Scoring

26

A. Traditional ETL tools for transfer, HDFS/S3 for storage, Spark for processing

B. Model analysis with iPython/Scala Notebook, Spark for processing, HDFS/HBase/Cassandra for storage

C. Kafka real-time ingest, introduce Storm/Spark Streaming for near-realtime interception of data, HBase for model/rule storage and lookup

D. JPMML/Spark Streaming for realtime model scoring



Our framework shopping list

27

iPython & Scala

Notebooks

Explore & Train Ingest, Store, Score, & Act

Spark ::Core ::MLLib

::Streaming ::GraphX?

Intercept with Storm?

Spark Streaming?

Kafka, Hadoop, HBase, Cassandra, SolrCloud, & S3

OpenScoring?

JPMML? R?


[email protected] - @jabenitez28

Fin


2015 feb 24_paytm_labs_intro_ashwin_armandoadam

Technology

engagement optimization

marketing optimization

future positive examples

ad network support

merchant intelligence

app personalization

concept r

rabbitmqdata sciencepipeline1