how to create 80% of a big data pilot project
TRANSCRIPT
![Page 1: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/1.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. October 2015
Download, Forums, Docs, Events http://Kamanja.org
Meet 80% of the Needs of a Pilot Project With a CC Fraud Detection Example By Greg Makowski ACM Data Science Camp, Saturday 10/24/2015 http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015 http://kamanja.org/white-papers/
![Page 2: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/2.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 2
ligaDATASummary
Preprocessing & Scores
Ques%on) How to help a OSS pilot evalua0on go faster? Answer) Develop “design pa:erns” for applica0ons Pick a specific app (Credit Card Fraud Detec0on) Get data (end up genera0ng it) Need to vary arch config (like performance tes0ng) Given requirements, generate a mul0-‐node example pilot system, involving many OSS components PMML can abstract the produc0on step from model building
![Page 3: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/3.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 3
ligaDATAProblem
When evalua0ng any new data mining or big data soPware, companies want to “try it out” and see how it meets their requirements. A common step is a pilot project. A pilot would commonly involve integra0on with related soPware systems. Open Source SoPware (OSS) may come with examples. Need an example “produc%on system” Q) What can be done to shorten the 0me to finish a Pilot?
![Page 4: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/4.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 4
ligaDATAProblem: Questions to be answered from Pilot
How fast is it? It depends (yes, that is an annoying answer)
How to configure the system with other OSS soBware?
It depends (yes, that is an annoying answer)
![Page 5: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/5.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 5
ligaDATAProblem: Questions to be answered from Pilot
How fast is it? It depends (yes, that is an annoying answer) Show example configs with performance results
How to configure the system with other OSS soBware?
It depends (yes, that is an annoying answer) Consider different applica%on “design paCerns”
How will the system grow as complexity grows?
The answer is specific per design pa:ern How should DevOps monitor and manage?
![Page 6: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/6.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 6
ligaDATAKamanja Platform
Storage
Ouput Queues
Input Queues
Decisioning Ac0ons
CDC, Logs, Apps
Next Best Action
Batch Stores
Application Updates
Decision Engine
Admin Management
kamanja
Databases
ESBs
Alerts & Notifications
Social 3rd Party
Data Sources
Data Store
![Page 7: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/7.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 7
ligaDATA
See Kamanja.org, and github
Kamanja is used as an example, The process is in this talk is general and can be broadly applied to other OSS. Kamanja is a big data con0nuous decisioning system
Apache license, available on github
![Page 8: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/8.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 8
ligaDATAApplication Design Pattern Departmental Model Scoring Application
Scaling challenges transac0on growth and type (quan0ty & speed) model complexity (hybrid systems) quan0ty of models: 10’s to 10k’s for most models, most fields, need to access the data store for preprocessing
Input queue
Model Scoring
Real time Output Queue
Cache + Data Store M
anag
emen
t and
C
ontro
l Sys
tem
Financial Log Consumer Business
Preprocessing & Scores
Reporting Analysis
Lambda Architecture
Combines Real time And Batch
PMML
![Page 9: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/9.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 9
ligaDATAApplication Design Pattern Social Network Analysis
Scaling challenges transac0on growth and source (quan0ty & speed) model: sen0ment, graph quan0ty of models: a few data store lookup for base user info
Input queue
Model Scoring
Real Time Charting, Alerting
Cache + Data Store M
anag
emen
t and
C
ontro
l Sys
tem
Twitter Facebook :
User baseline Network
Trend Analysis Deep Dive
Java, Scala
![Page 10: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/10.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 10
ligaDATAApplication Design Pattern Text Mining, Search
Scaling challenges transac0on growth some projects: very heavy compu0ng for NLP parsing quickly score on tagged results
Input queue
Model Scoring
Output Queue
Cache + Data Store M
anag
emen
t and
C
ontro
l Sys
tem
Pages Documents Posts Tweets
Java, Stanford NLP
Parse trees Inverted indexes
Trending topics Update Thesaurus Docs ßà Topics
![Page 11: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/11.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 11
ligaDATADetails on Departmental Scoring: Credit Card Fraud Detection System
How to develop an example system? There is no public data. Private won’t be shared
Generate the data (then can also test BIG DATA)
Focus on 5 use cases of “normal” and 5 “fraud” Configuring architecture can be used for 1) Performance tes0ng for different requirements 2) Pilot system, example included w/ Kamanja
Train models, generate PMML for scoring
![Page 12: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/12.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 12
ligaDATACredit Card Fraud Detection System FRAUD Use Cases
Fraudster extrac%ng value out of hacked card Likely a first “test” of CC info. iTunes or unmanned gas pump w/o camera Drain account up to CC limit in 15 min, up to 2-‐3 days Purchase things “easy to cash out or resell” – launder money giP cards, gems, jewelry, small electronics easy to sell, burner phones
F1) Elder abuse – either PII or CC info gets copied
Fraudster opens first web or mobile account (surprising for grandmother) Higher credit limit, long 0me with no web/mobile Long 0me CC holder (high tenure), li:le spend varia0on
F2) Hacker bought PII (Personally Iden0fiable Informa0on) Fraudster used PII to apply for a new account new account likely has a lower credit limit Over 1st month, slowly changes PII to fraudsters to not alert vic0m use in “card not present” situa0ons
![Page 13: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/13.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 13
ligaDATACredit Card Fraud Detection System FRAUD Use Cases
F3) Physical clone Fraudster may have bought CC info online ($1/account) or copied mag strip from the vic0m in the store. Fraudster card use can be concurrent with normal consumer use – or very different place and 0me zone F4) Rare Behavior (may be part of other use cases) Unusual 0me of day, geography, spending by type of goods / services F5) Risky Behavior – fraudster may visit blacklisted web page Fraudster is engaging with Geography changes are not plausible (noon in San Jose, 1pm in Hong Kong) Relate to past labeled cases of CC fraud.
![Page 14: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/14.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 14
ligaDATACredit Card Fraud Detection System NORMAL Use Cases
1) Steady State use – the CC use by these people is fairly consistent and stable. Can have a die vei
2) New Card, 1st month – this example is setup to make it difficult to compare with fraudulently opened new cards.
Spending may max out 3) Young and star%ng singles or newly married. These people don’t have much of a credit ra0ng More likely to use web and mobile channels. More likely to wander to dangerous areas of the web. Likely to spend in a bigger array of categories Possibly many geographic loca0ons 4) Normal Case, Family – Medium to higher income limit, many don’t hit limit Low to moderate showing up in new geographies, or spending on new catagor. 5) Work Travel – Work in sales or consul0ng. New loca0ons are no surprise. Higher spending limit and amounts, many flight, hotel, car rental, high mobile
![Page 15: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/15.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 15
ligaDATAPilot Project & Performance Testing Credit Card Fraud Detection
Input queue
Model Scoring
Real time Output
Input queue
Real time Output Input
queue Input queue
Model Scoring Model
Scoring Model Scoring Model
Scoring Model Scoring
Real time Output Real time
Output
Input queue
Model Scoring
Real time Output
Cache + Data Store
Preprocessing & Scores
Model Scoring Model
Scoring Model Scoring Model
Scoring Model Scoring
Model Scoring Model
Scoring Model Scoring Model
Scoring Model Scoring Model
Scoring
1 Kafka 1 Kamanja 1 Kafka
~3 Kafka 16 Kamanja ~3 Kafka
Add Preprocessing Logic and HBase table lookup
![Page 16: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/16.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 16
ligaDATAPerformance Testing – Model Node Credit Card Fraud Detection
Preprocessing & Scores
Fields per record: tes0ng network speed between nodes 30, 120, 480 fields (yes, could go 10k, 100k)
Single model complexity: tes0ng compute load
Small, Medium & Large (100, 2k, 32.5k elements) Preprocessing lookup tables: tes0ng cache to HB & netwrk
none, some Ensemble Models per score: tes0ng compute & network
1, 5, 20 Number of Models in department: 1, 10, 100
![Page 17: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/17.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 17
ligaDATASolution to Developer Questions (How Fast, How to Configure?)
How many fields per record? 30, 120, 480 (SML) What model complexity? 100, 2k, 32.5k (SML) Is data already preprocessed? Yes, No (YN) Average models / ensemble? 1, 5, 20 (SML) How many models in the department? 1, 10, 100 (SML) What language? PMML, Java, Scala
(I want to create a table like…) Requirements à Then need configure For speed rec/s S,S,Y,M,S 1 Kaf, 1 Kam, 1 Kaf 1.1mm M,L,Y,M,S 1 Kaf, 1 Kam, 1 Kaf 200K L,L,N,L,L 3 Kaf, 16 Kam, 1 Kaf, 3HB 1.6mm
Generate Architecture and run an 80% relevant Pilot
![Page 18: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/18.jpg)
Text or Twitter
API Java 1
and GUI Kafka Java 3 for analysis
Data Store
Java calls API, and Kafka producer
Tweets returned in JSON
JSON tweets sent to Kafka Kafka JSON to Kamanja
JSON with features saved in DB
JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..)
JSON returns the aggregate query results to JAVA
JSON query results to Kafka
JSON results of rule scoring, alert text
13 Tomcat web service displays data
and charts
Matched_tags_ per_text
table
results to Java 3 for scoring,
with thresholds
Alerts table
Save results to DB
JAVA 1: check for updates to the alerts table
Kamanja 1
2
3 4
5
6
7
8 9
11
12
10
Java 2 for Features Sentiment or Stanford NLP
Social Netowork Analysis: Example System Configuration
ligaDATA
![Page 19: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/19.jpg)
19 © 2015 ligaDATA, Inc. All Rights Reserved.
ligaDATA
Scoring Engine
(Kamanja)
PMML Diagram Predictive Modeling Markup Language
Training & test data (batch)
Data Mining Tool File, Save As
PMML
PMML File
PMML Producer
(18 available)
PMML File Scoring data
(real time streaming) Output data has new score field
Training Project Phase
Production Scoring Project Phase
Full model specification
PMML Consumer
![Page 20: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/20.jpg)
20 © 2015 ligaDATA, Inc. All Rights Reserved.
ligaDATAGiven industry fragmentation, PMML is a solution for Data Mining scoring PMML Producers (18 data mining packages) • R (Rattle, PMML)* • RapidMiner • KNIME*
PMML Consumers (12 co) • Zementis • IBM SPSS • KNIME • Microstrategy • SAS • Kamanja* (Open Source)
• Spark (MLib)* * = Open Source • Weka* • SAS Enterprise Miner
PREDICTIVE Naïve Bayes Neural Net Regression Rules Scorecard Sequence SVM Time Series Trees
DESCRIPTIVE / OTH Association Rules Cluster, K-Nearest Nb Text Models model ensembles & composition (i.e. Gradient Boosting)
![Page 21: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/21.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved. 21
ligaDATASummary
Preprocessing & Scores
Ques%on) How to help a OSS pilot evalua0on go faster? Answer) Develop “design pa:erns” for applica0ons Pick a specific app (Credit Card Fraud Detec0on) Get data (end up genera0ng it) Need to vary arch config (like performance tes0ng) Given requirements, generate a mul0-‐node example pilot system, involving many OSS components PMML can abstract the produc0on step from model building
![Page 22: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/22.jpg)
© 2015 ligaDATA, Inc. All Rights Reserved.
Try outKamanja
© 2015 ligaDATA, Inc. All Rights Reserved. CONFIDENTIAL
Download, Forums, Docs, Events http://Kamanja.org
ligaDATA
http://kamanja.org/white-papers/
![Page 23: How to Create 80% of a Big Data Pilot Project](https://reader031.vdocuments.us/reader031/viewer/2022022412/58f1bef41a28abbb098b4571/html5/thumbnails/23.jpg)
Kamanja: 220k to 230k messages / second
CONFIGURATION: • 16 core box, using Solid State Disc • Sample Tool to generate messages of size 1k (not being reduced) • Data Mining uses 100’s to 100k fields – not 100 byte message • Kafka Queue • 3 input queues, each queue has 8 partitions • Kamanja Engine • Using the remaining 12-13 cores • Not saving score results per record in this test
SO WHAT? COMPARISON: • Storm is currently the lowest latency Apache big data system • Storm integration, got up to 90k to 100k for same data • Kamanja is 2.4 times faster than Storm = (225k/95k) in this test • Spark streaming is with mini-batches, with higher latency than Storm or Kamanja
Why is Kamanja faster than Storm? Storm reads the data from the input queue (sprout) and passes that to Bolts. Each pass between sprout to bolt they serialize & deserialize the data. There is other overhead.
Kamanja: One Speed Analysis
ligaDATA