data mining: what is all that data telling us? dave dickey ncsu statistics

60
Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Upload: gwendolyn-mckenzie

Post on 02-Jan-2016

221 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Data Mining: What is All That Data Telling Us?

Dave Dickey NCSU Statistics

Page 2: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

What I know • What “they” can do • How “they” can do it

• What is some particular entity doing ?• How safe is your particular information ?• Is big brother watching me right now ?

What I don’t know

Page 3: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

* Data being created at lightning pace* Moore’s law: (doubling / 2 years – transistors on integrated circuits)

Internet “hits”Scanner cardse-mailsIntercepted messagesCredit scoresEnvironmental MonitoringSatellite ImagesWeather DataHealth & Birth Records

Page 4: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

So we have some data – now what??

• Predict defaults, dropouts, etc.

• Find buying patterns

• Segment your market

• Detect SPAM (or others)

• Diagnose handwriting

• Cluster

• ANALYZE IT !!!!

Page 5: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Data Mining - What is it?

• Large datasets• Fast methods• Not significance testing• Topics

– Trees (recursive splitting)– Nearest Neighbor– Neural Networks– Clustering– Association Analysis

Page 6: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Trees

• A “divisive” method (splits) • Start with “root node” – all in one group• Get splitting rules• Response often binary• Result is a “tree”• Example: Framingham Heart Study

Page 7: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Recursive Splitting

X1=DebtToIncomeRatio

X2 = Age

Pr{default} =0.007 Pr{default} =0.012

Pr{default} =0.0001

Pr{default} =0.003

Pr{default} =0.006

x x x x x xx x x x x x x xx x x x x x x x xxx x x x x x x x D x x x x x x x x x x x x x x x x x x D x x x x xx x x x D x x x x x x x x x x x x x x x x x D x x x x x x xx x x x x x x x x xD x x x x D xx x x x x x x x x D x x x x x x x x x x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x D x x x xx x x D x x x x x x x x x x x x x x x X x x x x x x x x x x x x x xD x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xDx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x DD x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x xx x x x x x x x xxx x x D x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x xx D x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x xxx x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x X x x x x D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x

x x x x x xx x x x x x x xx x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x xx x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x DX x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x D D x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x

Page 8: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Some Actual Data

• Framingham Heart Study

• First Stage Coronary Heart Disease – P{CHD} = Function of:

• Age - no drug yet! • Cholesterol• Systolic BP

Import

Page 9: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Example of a “tree”

All 1615 patients

Split # 1: Age

“terminal node”Systolic BP

Page 10: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

How to make splits?

• Which variable to use? • Where to split?

– Cholesterol > ____– Systolic BP > _____

• Goal: Pure “leaves” or “terminal nodes”• Ideal split: Everyone with BP>x has problems,

nobody with BP<x has problems

Page 11: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Where to Split?

• Maximize “dependence” statistically• We use “contingency tables”

95 5

55 45

Heart DiseaseNo Yes

LowBP

HighBP

100

100

DEPENDENT

75 25

75 25

INDEPENDENT

Heart DiseaseNo Yes

Page 12: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Measuring Dependence • Expect 100(150/200)=75 in upper left if

independent (etc. e.g. 100(50/200)=25)

95

(75)

5

(25)

55

(75)

45

(25)

Heart DiseaseNo Yes

LowBP

HighBP

100

100

150 50 200

How far from expectations is “too far” (significant dependence)

Page 13: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

2 Test Statistic

95

(75)

5

(25)

55

(75)

45

(25)

LowBP

HighBP

100

100

150 50 200

allcells ected

ectedobserved

exp

)exp( 22

2(400/75)+2(400/25) = 42.67

42.67 - So what?

Page 14: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Use Probability! “P-value”

“Significance Level” (0.05)

Page 15: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Measuring “Worth” of a Split

• P-value is probability of as great as that observed if independence is true.

• (Pr {2>42.67} is 0.000000000064

• P-values all too small to understand.

• Logworth = -log10(p-value) = 10.19

• Best Chi-square max logworth.

Page 16: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Logworth for Age Splits

Age 47 maximizes logworth

Page 17: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

How to make splits?

• Which variable to use?

• Where to split?– Cholesterol > ____– Systolic BP > _____

• Idea – Pick BP cutoff to minimize p-value for 2

• What does “signifiance” mean now?

Page 18: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Multiple testing

• 50 different BPs in data, 49 ways to split

• Sunday football highlights always look good!

• If he shoots enough baskets, even 95% free throw shooter will miss.

• Tried 49 splits, each has 5% chance of declaring significance even if there’s no relationship.

Page 19: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Multiple testing

= Pr{ falsely reject hypothesis 1}

= Pr{ falsely reject hypothesis 2}

Pr{ falsely reject one or the other} < 2Desired: 0.05 probabilty or lessSolution: use = 0.05/2Or – compare 2 to 0.05

Page 20: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Multiple testing

• 50 different BPs in data, m=49 ways to split

• Multiply p-value by 49

• Stop splitting if minimum p-value is large (logworth is small).

• For m splits, logworth becomes

-log10(m*p-value)

Page 21: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Other Split Evaluations

• Gini Diversity Index– { E E E E G E G G L G}– Pick 2, Pr{different} =

• 1-Pr{EE}-Pr{GG}-Pr{LL}

• 1- [ 10 + 6 + 0]/45 =29/45=0.64

– { E E G L G E E G L L }• 1-[6+3+3]/45 = 33/45 = 0.73

• MORE DIVERSE, LESS PURE

• Shannon Entropy– Larger more diverse (less pure)

– -i pi log2(pi) {0.5, 0.4, 0.1} 1.36

{0.4, 0.2, 0.3} 1.51

(more diverse)

Page 22: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Goals

• Split if diversity in parent “node” > summed diversities in child nodes

• Observations should be – Homogeneous (not diverse) within leaves– Different between leaves– Leaves should be diverse

• Framingham tree used Gini for splits

Page 23: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Cross validation

• Traditional stats – small dataset, need all observations to estimate parameters of interest.

• Data mining – loads of data, can afford “holdout sample”

• Variation: n-fold cross validation– Randomly divide data into n sets– Estimate on n-1, validate on 1– Repeat n times, using each set as holdout.

Page 24: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Pruning

• Grow bushy tree on the “fit data” • Classify holdout data• Likely farthest out branches do not improve,

possibly hurt fit on holdout data• Prune non-helpful branches. • What is “helpful”? What is good discriminator

criterion?

Page 25: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Goals

• Want diversity in parent “node” > summed diversities in child nodes

• Goal is to reduce diversity within leaves• Goal is to maximize differences between leaves• Use same evaluation criteria as for splits• Costs (profits) may enter the picture for splitting or

evaluation.

Page 26: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Accounting for Costs

• Pardon me (sir, ma’am) can you spare some change?

• Say “sir” to male +$2.00 • Say “ma’am” to female +$5.00• Say “sir” to female -$1.00 (balm for slapped

face)• Say “ma’am” to male -$10.00 (nose splint)

Page 27: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Including Probabilities

True Gender

M

F

Leaf has Pr(M)=.7, Pr(F)=.3. You say:

M F

0.7 (2)

0.3 (-1)

0.7 (-10)

0.3 (5)

Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir” Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am”Weight leaf profits by leaf size (# obsns.) and sumPrune (and split) to maximize profits.

Page 28: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Additional Ideas

• Forests – Draw samples with replacement (bootstrap) and grow multiple trees.

• Random Forests – Randomly sample the “features” (predictors) and build multiple trees.

• Classify new point in each tree then average the probabilities, or take a plurality vote from the trees

Page 29: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

* Lift Chart - Go from leaf of most to least response. - Lift is cumulative proportion responding.

Page 30: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Regression Trees• Continuous response (not just class)

• Predicted response constant in regions

Predict 50 Predict 80

Predict 100

Predict 130 Predict

20

X1

X2

{47, 51, 57, 45} 50 = mean

Page 31: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

• Predict Pi in cell i (it’s cell mean)

• Yij jth response in cell i.

• Split to minimize i j (Yij-Pi)2

• [sum of squared deviations from cell mean]

Predict 50 Predict 80

Predict 100

Predict 130 Predict

20

{-3, 1, 7, -5}SSq=9+1+49+25 = 84

Page 32: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

• Predict Pi in cell i.

• Yij jth response in cell i.

• Split to minimize i j (Yij-Pi)2

Page 33: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Logistic Regression

• Logistic – another classifier

• Older – “tried & true” method

• Predict probability of response from input variables (“Features”)

• Need to insure 0 < probability < 1

Page 34: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics
Page 35: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Example: Shuttle Missions

• O-rings failed in Challenger disaster• Low temperature• Prior flights “erosion” and “blowby” in O-rings• Feature: Temperature at liftoff• Target: problem (1) - erosion or blowby vs. no

problem (0)

Page 36: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

• We can easily “fit” lines• Lines exceed 1 , fall below 0 • Model L as linear in temperature • L = a+b(temp)• Convert: p = eL/(1+eL) =

ea+b(temp)/ (1+ea+b(temp))

Convert

Page 37: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics
Page 38: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Example: Ignition

• Flame exposure time = X

• Ignited Y=1, did not ignite Y=0– Y=0, X= 3, 5, 9 10 , 13, 16 – Y=1, X = 11, 12 14, 15, 17, 25, 30

• Probability of our data is “Q”

• Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp

• P’s all different p=f(exposure)

• Find a,b to maximize Q(a,b)

Page 39: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Likelihood function (Q)

-2.6

0.23

Page 40: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSqIntercept 1 -2.5879 1.8469 1.9633 0.1612TIME 1 0.2346 0.1502 2.4388 0.1184

Association of Predicted Probabilities and Observed Responses

Percent Concordant 79.2 Somers' D 0.583Percent Discordant 20.8 Gamma 0.583Percent Tied 0.0 Tau-a 0.308Pairs 48 c 0.792

Page 41: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

5 right, 4 wrong

4 right, 1 wrong

Page 42: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Example: Framingham• X=age • Y=1 if heart trouble, 0 otherwise

Page 43: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics
Page 44: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Framingham

The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates

Standard WaldParameter DF Estimate Error Chi-Square Pr>ChiSq

Intercept 1 -5.4639 0.5563 96.4711 <.0001age 1 0.0630 0.0110 32.6152 <.0001

Page 45: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics
Page 46: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics
Page 47: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Neural Networks

• Very flexible functions• “Hidden Layers” • “Multilayer Perceptron”

Logistic function of

Logistic functions

Of data

outputinputs

Page 48: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Arrows represent linear combinations of “basis functions,” e.g. logistics

b1

Example:

Y = a + b1 p1 + b2 p2 + b3 p3

Y = 4 + p1+ 2 p2 - 4 p3

Page 49: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

• Should always use holdout sample• Perturb coefficients to optimize fit (fit data)• Eliminate unnecessary arrows using holdout data.

Page 50: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Terms• Train: estimate coefficients• Bias: intercept a in Neural Nets• Weights: coefficients b • Radial Basis Function: Normal density• Score: Predict (usually Y from new Xs)• Activation Function: transformation to target• Supervised Learning: Training data has

response.

Page 51: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Hidden LayerL1 = -1.87 - .27*Age – 0.20*SBP22H11=exp(L1)/(1+exp(L1))

L2 = -20.76 -21.38*H11Pr{first_chd} = exp(L2)/(1+exp(L2))“Activation Function”

Page 52: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics
Page 53: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Unsupervised Learning

• We have the “features” (predictors)

• We do NOT have the response even on a training data set (UNsupervised)

• Clustering– Agglomerative

• Start with each point separated

– Divisive • Start with all points in one cluster then spilt

Page 54: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Clustering – political (hypothetical)

• 300 people: “mark line to indicate concern”:• <-5> ---------0-------------- <+5>• X1: economy• X2: war in Iraq • X3: health care

• 1st person (2.2 -3.1 0.9)• 2nd person (-1.6 1 0.6)• Etc.

Page 55: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Clusters as Created

Page 56: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

As Clustered

Page 57: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Association Analysis

• Market basket analysis – What they’re doing when they scan your “VIP”

card at the grocery– People who buy diapers tend to also buy

_________ (beer?)– Just a matter of accounting but with new

terminology (of course ) – Examples from SAS Appl. DM Techniques, by

Sue Walsh:

Page 58: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Termnilogy

• Baskets: ABC ACD BCD ADE BCE

• Rule Support Confidence • X=>Y Pr{X and Y} Pr{Y|X}• A => D 2/5 2/3• C => A 2/5 2/4• B&C => D 1/5 1/3

Page 59: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Don’t be Fooled!• Lift = Confidence /Expected Confidence if Independent

Checking->

Saving V

No

(1500)

Yes

(8500) (10000)

No 500 3500 4000

Yes 1000 5000 6000

SVG=>CHKG Expect 8500/10000 = 85% if independentObserved Confidence is 5000/6000 = 83%Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!!

Page 60: Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

Summary

• Data mining – a set of fast stat methods for large data sets

• Some new ideas, many old or extensions of old• Some methods:

– Decision Trees– Nearest Neighbor– Neural Nets– Clustering– Association