local terabytes, remote terabytes, and distributed ... · data mining architecture data updates 1....

43
Local Terabytes, Remote Terabytes, and Distributed Terabytes: Four Case Studies in Data Mining Robert Grossman National Center for Data Mining University of Illinois at Chicago and Open Data Partners

Upload: others

Post on 21-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Local Terabytes, Remote Terabytes,and Distributed Terabytes:

Four Case Studies in Data Mining

Robert GrossmanNational Center for Data MiningUniversity of Illinois at Chicago

andOpen Data Partners

Page 2: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Introduction: What is datamining?

Page 3: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

What is Data Mining?Short definition:Finding interesting structure in data.

(Interesting implies actionable.)

Long definition: Semi-automatic discovery of patterns, correlations,

changes, associations, anomalies, and otherstatistically significant structures in large data sets.

Page 4: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Why Bother: The Data Gap

Number new Ph.D.s

Amount new disk

The goal of datamining is to close

this gap.

Page 5: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Two Cultures: Data Science vsDecision Support

ROI, improvement overcurrent decision orbusiness process

results publishedEvaluation

data access & cleaning;implementation

data analysisChallenge

lift of model (measurede.g. by ROC)

hypothesis testingMethodology

analyzed until resultsare due

collected & cleaneduntil conclusionsproven

Data

Take an actionGain understandingGoal

Page 6: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Where is the Data?

Units

Thousands

Organization Collaboration Community

Millions

User Base

NumberResources

Web-basedData

RelationalDatabases

Grid-basedDatabases

Centralized Control

Time

Rigidity ofStructure Device-

based Data

Billions

Consumers

Page 7: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Case Study 1:Payments Card Fraud System

Account

Merchant

Issuing Bank

Merchant Bank

Working with your homogeneous terabytes.

PaymentsProcessor

Page 8: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

ChallengesTechnical

– Develop algorithms that scale to out of memorydata.

– Develop algorithms that scale to highdimensional data.

Practical– Develop algorithms that can be quickly

deployed into operational systems.

Page 9: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Classification Trees

Want a function y = f(x), which predicts the redvariable Y using one or more of the blue variables x =(Vel 1, Vel 2, MCC, Ind 1)

Assume each row is classified 0 or 1

Vel 1 Vel 2 MCC Ind 1 Fraud

02 14 33 0 0

24 56 31 0 0

23 51 31 1 1

13 45 28 0 0

Page 10: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Trees Partition Feature Space

Trees partition the feature space into regions by askingwhether an attribute is less than a threshold.

Vel 2

Velocity 1

49.5

7 17.5

Vel 2 > 7

Vel 2 > 17.5?

No Fraud

B

No Fraud

Fraud

No Fraud

Fraud No Fraud

Vel 1 > 49.5?

Page 11: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Key Idea: Combine Weak Learners

It is often better to build several models, and then toaverage them, rather than build one complex model.

Work in the algebra generated by the multipleclassifiers f1(x), f2(x), f3(x), etc.

Model 1

Model 2

Model 3

Page 12: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Combining Weak Learners

1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1

76.50%71.00%65%68.20%64.0%60%59.30%57.40%55%

5 Classifiers3 Classifiers 1 Classifier

Page 13: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Building Models over Clusters

Scatter data Build models (e.g.

tree-based model) Gather models

into ensemble(e.g. majority votefor classification& averaging forregression)

data models

Page 14: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Lessons Learned Use tree based classifiers to deal with large number

of attributes. Use ensembles of trees to deal with large amount of

data. Used ensembles with 80+ trees. Implementedusing clusters.

Use column-wise warehouses to speed up statisticaloperations on large data ets.

Big win from using standards-based scoring enginesto deploy models in 24x7x365 systems so that nocustom code is required when updating analytics

Reduced deployment time from months to weeksweek. Important for problems like fraud in whichtarget responds and adapts

Page 15: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Case Study 2: Highway Traffic Data Is the traffic speed

and volume today(Tuesday, Nov. 15,3 pm, conventionevent, no rain)different than thebaseline?

If so, send an alertto a PDA.

• 833 road sensors• weather data (images, xml)• text data about special events

Working with your heterogeneous terabytes.

Page 16: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Challenges Technical

– High volume, complex,multi-modal, distributedstreaming data

– Data highly heterogeneous Pragmatic

– Real time alerts to PDA– Effectively providing

awareness of changes

Page 17: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Change Detection Algorithms

Sequence of events x[1], x[2], x[3], … Question: is the observed distribution different than the

baseline distribution? Used CUSUM & Generalized Likelihood Ratio (GLR) tests

ObservedModel

BaselineModel

β

Page 18: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Key Idea 1: Build 104+ Models

1. Divide & conquer data(segment) usingmultidimensional datacubes

2. For each distinct cube,estimate parameters forseparate statisticalmodel

3. Detect changes frombaselines and sendalerts in real time

Time

Entity(specialevent, etc.)

Geospatialregion

Change Detection using Cubes of Models (CDCM)- separate

baselines for each cell

Page 19: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Greedy Meaningful/ManageableBalancing (GMMB) Algorithm

• Fewer alerts

• Alerts more manageable

•To decrease alerts, remove breakpoint,order by number

of decreased alerts, & select one or more breakpoints to remove

• More alerts

• Alerts more meaningful

• To increase alerts, add breakpoint to split cubes,

order by number of new alerts, &

select one or more new breakpoints

One model for each cell in data cube

Breakpoint

Page 20: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Key Idea 2: Event BasedData Mining Architecture

dataupdates

1. data collection

Sensors, news feeds, weather data, etc.

3. on-line deployment

Alert Management Sys.

alerts

states

events

PMML models

State Database

2. off-line modeling

Data Mining Warehouse

learning sets Data Mining System

Page 21: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Lesson Learned Change Detection using Cubes of Models

(CDCM) is an effective methodology for detectingchanges in highly heterogeneous data

The Greedy Meaningful/Manageable Balancing(GMMB) Algorithm is critical to building afunctional system

An architecture based upon Predictive ModelMarkup Language (PMML), specifically PMML-producers and PMML-consumers and a few basicsegmentation techniques can effectively managethousands to millions of individual statisticalmodels

Page 22: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Case Study 3 - IntegratingStreaming Data

Working with your friends’ terabytes….

Page 23: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Finding Candidate Brown Dwarfs Sloan Digital Sky Survey (SDSS)

– 82 million stars– Visible spectrum

Two Micro All Sky Survey (2MASS)– 208 million stars– Infrared spectrum

Two separate locations - Query at SC 05 in Seattle– SDSS in Tokyo & 2MASS in Chicago

Found 289,283 Candidate Brown dwarfs– Common index structure for each cell in sky (metadata)– Object in both locations; infrared value is 2 degree brighter

Page 24: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Challenges

Technical - Streaming joins not wellunderstood

Practical - Accessing distributed terabytesof data over high bandwidth delay productnetworks is still a problem in practice

Mathis Equation

Page 25: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Current Protocols (TCP) Don’t Work OverHigh Bandwidth Wide Area Networks

0.01%

0.05%

0.1%

0.1%

0.5%

1000

800

600

400

200

1 10 100 200 400

1000

800

600

400

200

Thro

ughp

ut (M

b/s)

Thro

ughp

ut (M

b/s)

Pack

et Loss

Round Trip Time (ms)

LAN US-EU US-ASIAUS

Page 26: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

1. Goal: Exploit availablebandwidth of wide area 10Gbps networks for distributeddata mining.

2. Developed new applicationlevel network protocol - UDT

3. UDT is fair to other highvolume data flows

4. UDT is friendly tocommodity TCP flows.

5. UDT is easy to deploy sinceapplication level.

We developed streaming joinsand data mining primitives

over UDT

Key Idea: Network Protocols Matter(Factors of 10x, 100x, 1000x)

Page 27: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

UDT Introduced AIMD withDecreasing Increases

AIMD (Additive Increases, Multiplicative Decreases)– x = x + α(x), for every constant interval (e.g., RTT)– x = (1 - β) x, when there is a packet loss eventwhere x is the packet sending rate.

TCP– α(x) ≡ 1, and the increase interval is RTT.– β = 0.5

AIMD with Decreasing Increase– α(x) is non-increasing, and limx->+∞ α(x) = 0.

Page 28: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

AIMD with DecreasingIncreases

α(x)

x

AIMD (TCP NewReno)

UDT

HighSpeed TCP

Scalable TCP

Page 29: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Case Study 4 - IntegratingProteomics Data

Stranger’s Gigabytes and Terabytes

ProteomicsGrid

Page 30: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

What is a Chemical Key?Testosterone,

C19H28O2 NSC id 9700 CAS id 58-22-017-hydroxyandrost-

4-en-3-oneAndrolinCristerona THomosteron

CH3

CH3OH

O

A Chemical key is a globallyunique key or ID associatedwith a chemical compound.

Page 31: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Example 1

Page 32: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Example 2

Page 33: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Unique Chemical Key (UCK)Algorithm - Path Labels

The set of pathsis naturallydefined.

Paths can be lexordered.

O

C

C 0

H

1. Set of paths of length less or equal to 2originating from C: {CO, CC, COH}.

2. Lexigraphically order: [CC, CO, COH].

3. Concatenate: CCCOCOH (path label)

CCCOCOH

Page 34: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Universal Chemical Keys (UCKs) -Graph Labels

O

OO

CH3

CH3

O

CH3

CH3

CH3

O

OO

CH3

CH3

OCH3

CH3

CH3

682322 682323

1. Fix depth d.Compute path labelsλ(u), for nodes u.

2. Loop over all pairs ofnodes u and v,compute length ofshortest path n andform λ(u) n λ(v).

3. Lex order.4. Concatenate.5. Hash.

Loop over all pairs ofnodes u and v andform “natural labels”

Page 35: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Example

098900…

C17H18O4682323

132020…

C17H18O4682322

UCKFormulaNSC

O

OO

CH3

CH3

O

CH3

CH3

CH3

O

OO

CH3

CH3

OCH3

CH3

CH3

682322 682323

Page 36: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Application 1: Keys for ChemicalCompunds (Analysis of NCI Database)

RemarkNumberDescription

UCK gave samekey to samecompounds

33,533Number chem.comp. 2 or moreentries

All gave uniqueUCK

202,384Number of chem.comp. withsingle entry

Somecompounds haveduplicate entries

236,917Total number ofchemicalcompounds

Page 37: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Application 2: Keys forMetabolic Pathways

KEGG database : Lysine biosynthesis

Page 38: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Conclusion

Page 39: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Three Trends for theNext Five Years

1. Forget data mining, the real pay-off is dataintegration, especially for distributed data

2. For many problems, streaming algorithmswill be the only choice available, whetherwe like it or not

3. Analytic algorithms for working with morecomplex data, e.g. graphs, semi-structureddata, etc. will become more and moreimportant.

Page 40: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

References (1 of 3)1. Payments Card Fraud

R. L. Grossman, H. Bodek, D. Northcutt, and H. V. Poor, Data Mining and Tree-based Optimization, Proceedings of theSecond International Conference on Knowledge Discovery and Data Mining, E. Simoudis, J. Han and U. Fayyad,editors, AAAI Press, Menlo Park, California, 1996, pp 323-326.

Robert L. Grossman, Alert Management Systems: A Quick Introduction, in Managing Cyber Threats: Issues, Approachesand Challenges, edited by Vipin Kumar, Jaideep Srivastava and Aleksandar Lazarevic, Springer Science+BusinessMedia, Inc., New York, pages 281-291, ISBN 0-387-24226-0.

2. Change Detection for Highway Traffic

Robert L. Grossman, Michal Sabala, Javid Alimohideen, Anushka Aanand, John Chaves, John Dillenburg, Steve Eick, JasonLeigh, Peter Nelson, Mike Papka, Doug Rorem, Rick Stevens, Steve Vejcik, Leland Wilkinson, and Pei Zhang,Real Time Change Detection and Alerts from Highway Traffic Data, ACM/IEEE SC 2005 Conference (SC'05).

L. Wilkinson, A. Anand and R. Grossman. High-dimensional Visual Analytics: Interactive Exploration Guided by PairwiseViews of Point Distribution, IEEE Transactions on Visualization and Computer Graphics, 2006.

Rajmonda Sulo, Anushka Anand, Leland Wilkinson, Robert Grossman, and Stephen Eick, Topographically-Based Real-TimeTraffic Anomaly Detection in a Metropolitan Highway System, submitted for publication.

Page 41: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

References (2 of 3)3. Sloan Distribution and Continuous Discovery

Yunhong Gu, Xinwei Hong, and Robert Grossman, Experiences in Design and Implementation of a HighPerformance Transport Protocol, SC 04

Yunhong Gu and Robert Grossman, Supporting Configurable Congestion Control in Data TransportServices, ACM/IEEE SC 2005 Conference (SC'05).

Robert L. Grossman, Yunhong Gu, David Handley, and Michal Sabala Joe Mambretti, Alex Szalay andAni Thakar, Kazumi Kumazoe and Oie Yuji, Minsun Lee, Yoonjoo Kwon, and Woojin Seok, DataMining Middleware for Wide Area High Performance Networks, Journal of Future GenerationComputer Systems (FGCS), 2006.

Robert L. Grossman, Yunhong Gu, Xinwei Hong, Antony Antony, Johan Blom, Freek Dijkstra, and Ceesde Laat, Teraflows over Gigabit WANs with UDT, Journal of Future Computer Systems, ElsevierPress, Volume 21, Number 4, 2005, pages 501-513.

Marco Mazzucco, Asvin Ananthanarayan, Robert L. Grossman, Jorge Levera, and Gokulnath BhagavanthaRao, Merging Multiple Data Streams on Common Keys over High Performance Networks,Proceedings of the IEEE/ACM SC2002 Conference, 2002, IEEE Computer Society, page 67.

Page 42: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

References (3 of 3)

4. Data Integration and Universal Chemical Keys (UCKs)

Greeshma Neglur and Robert L. Grossman, Assigning Unique Keys to Chemical Compounds for Data Integration: SomeInteresting Counter Examples, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), LaJolla, July 20-22, 2005.

Robert L. Grossman, Pavan Kasturi, Donald Hamelberg, Bing Liu, An Empirical Study of the Universal Chemical KeyAlgorithm for Assigning Unique Keys to Chemical Compounds, Journal of Bioinformatics and ComputationalBiology, 2004, Volume 2, Number 1, 2004, pages 155-171.

5. Data Mining and Data Integration Architectures

Robert Grossman, Mark Hornick, and Gregor Meyer, Data Mining Standards Initiatives, Communications of the ACM,Volume 45-8, 2002, pages 59-61.

Robert Grossman, and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining ofData, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51.

Page 43: Local Terabytes, Remote Terabytes, and Distributed ... · Data Mining Architecture data updates 1. data collection Sensors, news feeds, weather data, etc. 3. on-line deployment Alert

Thank you.

For more information: www.ncdm.uic.edu