how to create 80% of a big data pilot project

23
© 2015 ligaDATA, Inc. All Rights Reserved. October 2015 Download, Forums, Docs, Events http://Kamanja.org Meet 80% of the Needs of a Pilot Project With a CC Fraud Detection Example By Greg Makowski ACM Data Science Camp, Saturday 10/24/2015 http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015 http://kamanja.org/white-papers/

Upload: greg-makowski

Post on 15-Apr-2017

446 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. October 2015

Download, Forums, Docs, Events http://Kamanja.org

Meet 80% of the Needs of a Pilot Project With a CC Fraud Detection Example By Greg Makowski ACM Data Science Camp, Saturday 10/24/2015 http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015 http://kamanja.org/white-papers/

Page 2: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 2

ligaDATASummary

Preprocessing & Scores

Ques%on)  How  to  help  a  OSS  pilot  evalua0on  go  faster?    Answer)  Develop  “design  pa:erns”  for  applica0ons  Pick  a  specific  app                      (Credit  Card  Fraud  Detec0on)  Get  data                        (end  up  genera0ng  it)  Need  to  vary  arch  config  (like  performance  tes0ng)    Given  requirements,  generate  a  mul0-­‐node  example  pilot  system,  involving  many  OSS  components    PMML  can  abstract  the  produc0on  step  from  model  building          

Page 3: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 3

ligaDATAProblem

When  evalua0ng  any  new  data  mining  or  big  data  soPware,  companies  want  to  “try  it  out”  and  see  how  it  meets  their  requirements.    A  common  step  is  a  pilot  project.    A  pilot  would  commonly  involve  integra0on  with  related  soPware  systems.    Open  Source  SoPware  (OSS)  may  come  with  examples.  Need  an  example  “produc%on  system”    Q)  What  can  be  done  to  shorten  the  0me  to  finish  a  Pilot?  

Page 4: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 4

ligaDATAProblem: Questions to be answered from Pilot

How  fast  is  it?    It  depends    (yes,  that  is  an  annoying  answer)        

 How  to  configure  the  system  with  other  OSS  soBware?  

 It  depends    (yes,  that  is  an  annoying  answer)      

     

Page 5: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 5

ligaDATAProblem: Questions to be answered from Pilot

How  fast  is  it?    It  depends    (yes,  that  is  an  annoying  answer)    Show  example  configs  with  performance  results  

 How  to  configure  the  system  with  other  OSS  soBware?  

 It  depends    (yes,  that  is  an  annoying  answer)    Consider  different  applica%on  “design  paCerns”  

 How  will  the  system  grow  as  complexity  grows?  

 The  answer  is  specific  per  design  pa:ern    How  should  DevOps  monitor  and  manage?      

   

Page 6: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 6

ligaDATAKamanja Platform

Storage  

Ouput  Queues  

Input  Queues  

Decisioning   Ac0ons  

CDC, Logs, Apps

Next Best Action

Batch Stores

Application Updates

Decision  Engine  

Admin  Management  

kamanja

Databases

ESBs

Alerts & Notifications

Social 3rd Party

Data  Sources  

Data Store

Page 7: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 7

ligaDATA

See Kamanja.org, and github

Kamanja  is  used  as  an  example,      The  process  is  in  this  talk  is  general  and  can  be  broadly  applied  to  other  OSS.      Kamanja  is  a  big  data  con0nuous  decisioning  system  

 Apache  license,  available  on  github        

Page 8: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 8

ligaDATAApplication Design Pattern Departmental Model Scoring Application

Scaling  challenges    transac0on  growth  and  type    (quan0ty  &  speed)    model  complexity  (hybrid  systems)    quan0ty  of  models:  10’s  to  10k’s      for  most  models,  most  fields,        need  to  access  the  data  store  for  preprocessing  

   

Input queue

Model Scoring

Real time Output Queue

Cache + Data Store M

anag

emen

t and

C

ontro

l Sys

tem

Financial Log Consumer Business

Preprocessing & Scores

Reporting Analysis

Lambda Architecture

Combines Real time And Batch

PMML

Page 9: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 9

ligaDATAApplication Design Pattern Social Network Analysis

Scaling  challenges    transac0on  growth  and  source    (quan0ty  &  speed)    model:  sen0ment,  graph    quan0ty  of  models:  a  few    data  store  lookup  for  base  user  info  

   

Input queue

Model Scoring

Real Time Charting, Alerting

Cache + Data Store M

anag

emen

t and

C

ontro

l Sys

tem

Twitter Facebook :

User baseline Network

Trend Analysis Deep Dive

Java, Scala

Page 10: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 10

ligaDATAApplication Design Pattern Text Mining, Search

Scaling  challenges    transac0on  growth      some  projects:  very  heavy  compu0ng  for  NLP  parsing      quickly  score  on  tagged  results  

 

Input queue

Model Scoring

Output Queue

Cache + Data Store M

anag

emen

t and

C

ontro

l Sys

tem

Pages Documents Posts Tweets

Java, Stanford NLP

Parse trees Inverted indexes

Trending topics Update Thesaurus Docs ßà Topics

Page 11: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 11

ligaDATADetails on Departmental Scoring: Credit Card Fraud Detection System

How  to  develop  an  example  system?    There  is  no  public  data.      Private  won’t  be  shared  

 Generate  the  data                            (then  can  also  test  BIG  DATA)  

 Focus  on  5  use  cases  of  “normal”  and  5  “fraud”    Configuring  architecture  can  be  used  for        1)  Performance  tes0ng  for  different  requirements      2)  Pilot  system,  example  included  w/  Kamanja        

 Train  models,  generate  PMML  for  scoring    

         

Page 12: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 12

ligaDATACredit Card Fraud Detection System FRAUD Use Cases

Fraudster  extrac%ng  value  out  of  hacked  card    Likely  a  first  “test”  of  CC  info.    iTunes  or  unmanned  gas  pump  w/o  camera    Drain  account  up  to  CC  limit  in  15  min,  up  to  2-­‐3  days    Purchase  things  “easy  to  cash  out  or  resell”  –  launder  money                giP  cards,  gems,  jewelry,  small  electronics  easy  to  sell,  burner  phones  

 F1)  Elder  abuse  –  either  PII  or  CC  info  gets  copied  

 Fraudster  opens  first  web  or  mobile  account  (surprising  for  grandmother)    Higher  credit  limit,  long  0me  with  no  web/mobile    Long  0me  CC  holder  (high  tenure),  li:le  spend  varia0on  

F2)  Hacker  bought  PII  (Personally  Iden0fiable  Informa0on)    Fraudster  used  PII  to  apply  for  a  new  account              new  account  likely  has  a  lower  credit  limit            Over  1st  month,  slowly  changes  PII  to  fraudsters  to  not  alert  vic0m              use  in  “card  not  present”  situa0ons  

         

       

   

Page 13: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 13

ligaDATACredit Card Fraud Detection System FRAUD Use Cases

F3)  Physical  clone              Fraudster  may  have  bought  CC  info  online  ($1/account)  or  copied  mag  strip  from  the  vic0m  in  the  store.              Fraudster  card  use  can  be  concurrent  with  normal  consumer  use  –  or  very  different  place  and  0me  zone    F4)  Rare  Behavior  (may  be  part  of  other  use  cases)              Unusual  0me  of  day,  geography,  spending  by  type  of  goods  /  services    F5)  Risky  Behavior  –  fraudster  may  visit  blacklisted  web  page              Fraudster  is  engaging  with                Geography  changes  are  not  plausible  (noon  in  San  Jose,  1pm  in  Hong  Kong)              Relate  to  past  labeled  cases  of  CC  fraud.                

       

   

Page 14: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 14

ligaDATACredit Card Fraud Detection System NORMAL Use Cases

1)  Steady  State  use  –  the  CC  use  by  these  people  is  fairly  consistent  and  stable.    Can  have  a  die  vei  

2)  New  Card,  1st  month  –  this  example  is  setup  to  make  it  difficult  to  compare  with  fraudulently  opened  new  cards.  

               Spending  may  max  out  3)  Young  and  star%ng  singles  or  newly  married.            These  people  don’t  have  much  of  a  credit  ra0ng          More  likely  to  use  web  and  mobile  channels.              More  likely  to  wander  to  dangerous  areas  of  the  web.          Likely  to  spend  in  a  bigger  array  of  categories          Possibly  many  geographic  loca0ons  4)  Normal  Case,  Family  –            Medium  to  higher  income  limit,  many  don’t  hit  limit          Low  to  moderate  showing  up  in  new  geographies,  or  spending  on  new  catagor.  5)  Work  Travel  –  Work  in  sales  or  consul0ng.    New  loca0ons  are  no  surprise.    Higher  spending  limit  and  amounts,  many  flight,  hotel,  car  rental,  high  mobile                            

       

   

Page 15: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 15

ligaDATAPilot Project & Performance Testing Credit Card Fraud Detection

Input queue

Model Scoring

Real time Output

Input queue

Real time Output Input

queue Input queue

Model Scoring Model

Scoring Model Scoring Model

Scoring Model Scoring

Real time Output Real time

Output

Input queue

Model Scoring

Real time Output

Cache + Data Store

Preprocessing & Scores

Model Scoring Model

Scoring Model Scoring Model

Scoring Model Scoring

Model Scoring Model

Scoring Model Scoring Model

Scoring Model Scoring Model

Scoring

1 Kafka 1 Kamanja 1 Kafka

~3 Kafka 16 Kamanja ~3 Kafka

Add Preprocessing Logic and HBase table lookup

Page 16: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 16

ligaDATAPerformance Testing – Model Node Credit Card Fraud Detection

Preprocessing & Scores

Fields  per  record:    tes0ng  network  speed  between  nodes    30,  120,  480  fields          (yes,  could  go  10k,  100k)  

 Single  model  complexity:  tes0ng  compute  load  

 Small,  Medium  &  Large      (100,  2k,  32.5k  elements)    Preprocessing  lookup  tables:  tes0ng  cache  to  HB  &  netwrk  

 none,  some    Ensemble  Models  per  score:  tes0ng  compute  &  network  

 1,  5,  20    Number  of  Models  in  department:    1,  10,  100  

Page 17: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 17

ligaDATASolution to Developer Questions (How Fast, How to Configure?)

How  many  fields  per  record?                                  30,  120,  480        (SML)  What  model  complexity?                                              100,  2k,  32.5k    (SML)  Is  data  already  preprocessed?                                    Yes,    No                    (YN)  Average  models  /  ensemble?                                      1,  5,  20                      (SML)  How  many  models  in  the  department?    1,  10,  100              (SML)  What  language?                                                                                    PMML,  Java,  Scala    

(I  want  to  create  a  table  like…)  Requirements  à  Then  need  configure        For  speed  rec/s  S,S,Y,M,S    1  Kaf,  1  Kam,  1  Kaf    1.1mm    M,L,Y,M,S    1  Kaf,  1  Kam,  1  Kaf            200K    L,L,N,L,L                3  Kaf,  16  Kam,  1  Kaf,  3HB  1.6mm    

Generate  Architecture  and  run  an  80%  relevant  Pilot  

Page 18: How to Create 80% of a Big Data Pilot Project

Text or Twitter

API Java 1

and GUI Kafka Java 3 for analysis

Data Store

Java calls API, and Kafka producer

Tweets returned in JSON

JSON tweets sent to Kafka Kafka JSON to Kamanja

JSON with features saved in DB

JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..)

JSON returns the aggregate query results to JAVA

JSON query results to Kafka

JSON results of rule scoring, alert text

13 Tomcat web service displays data

and charts

Matched_tags_ per_text

table

results to Java 3 for scoring,

with thresholds

Alerts table

Save results to DB

JAVA 1: check for updates to the alerts table

Kamanja 1

2

3 4

5

6

7

8 9

11

12

10

Java 2 for Features Sentiment or Stanford NLP

Social Netowork Analysis: Example System Configuration

ligaDATA

Page 19: How to Create 80% of a Big Data Pilot Project

19 © 2015 ligaDATA, Inc. All Rights Reserved.

ligaDATA

Scoring Engine

(Kamanja)

PMML Diagram Predictive Modeling Markup Language

Training & test data (batch)

Data Mining Tool File, Save As

PMML

PMML File

PMML Producer

(18 available)

PMML File Scoring data

(real time streaming) Output data has new score field

Training Project Phase

Production Scoring Project Phase

Full model specification

PMML Consumer

Page 20: How to Create 80% of a Big Data Pilot Project

20 © 2015 ligaDATA, Inc. All Rights Reserved.

ligaDATAGiven industry fragmentation, PMML is a solution for Data Mining scoring PMML Producers (18 data mining packages) •  R (Rattle, PMML)* •  RapidMiner •  KNIME*

PMML Consumers (12 co) •  Zementis •  IBM SPSS •  KNIME •  Microstrategy •  SAS •  Kamanja* (Open Source)

•  Spark (MLib)* * = Open Source •  Weka* •  SAS Enterprise Miner

PREDICTIVE Naïve Bayes Neural Net Regression Rules Scorecard Sequence SVM Time Series Trees

DESCRIPTIVE / OTH Association Rules Cluster, K-Nearest Nb Text Models model ensembles & composition (i.e. Gradient Boosting)

Page 21: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved. 21

ligaDATASummary

Preprocessing & Scores

Ques%on)  How  to  help  a  OSS  pilot  evalua0on  go  faster?    Answer)  Develop  “design  pa:erns”  for  applica0ons  Pick  a  specific  app                      (Credit  Card  Fraud  Detec0on)  Get  data                        (end  up  genera0ng  it)  Need  to  vary  arch  config  (like  performance  tes0ng)    Given  requirements,  generate  a  mul0-­‐node  example  pilot  system,  involving  many  OSS  components    PMML  can  abstract  the  produc0on  step  from  model  building          

Page 22: How to Create 80% of a Big Data Pilot Project

© 2015 ligaDATA, Inc. All Rights Reserved.

Try outKamanja

© 2015 ligaDATA, Inc. All Rights Reserved. CONFIDENTIAL

Download, Forums, Docs, Events http://Kamanja.org

ligaDATA

http://kamanja.org/white-papers/

Page 23: How to Create 80% of a Big Data Pilot Project

Kamanja: 220k to 230k messages / second

CONFIGURATION: •  16 core box, using Solid State Disc •  Sample Tool to generate messages of size 1k (not being reduced) •  Data Mining uses 100’s to 100k fields – not 100 byte message •  Kafka Queue •  3 input queues, each queue has 8 partitions •  Kamanja Engine •  Using the remaining 12-13 cores •  Not saving score results per record in this test

SO WHAT? COMPARISON: •  Storm is currently the lowest latency Apache big data system •  Storm integration, got up to 90k to 100k for same data •  Kamanja is 2.4 times faster than Storm = (225k/95k) in this test •  Spark streaming is with mini-batches, with higher latency than Storm or Kamanja

Why is Kamanja faster than Storm? Storm reads the data from the input queue (sprout) and passes that to Bolts. Each pass between sprout to bolt they serialize & deserialize the data. There is other overhead.

Kamanja: One Speed Analysis

ligaDATA