the three analytics techniques. decision trees – determining probability
TRANSCRIPT
![Page 1: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/1.jpg)
The Three Analytics Techniques
![Page 2: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/2.jpg)
Decision Trees – Determining Probability
![Page 3: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/3.jpg)
![Page 4: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/4.jpg)
Decision Trees – Chi Square
![Page 5: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/5.jpg)
Example: Chi-squared test
Is the proportion of the outcome class the same in each child node?
It shouldn’t be, or the classification isn’t very helpful
Observed
Owns Rents
Default 300 450 750
No Default 550 200 750
850 650 1500
![Page 6: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/6.jpg)
Example: Chi-squared test
Is the proportion of the outcome class the same in each child node?
It shouldn’t be, or the classification isn’t very helpful
Root (n=1500)Default = 750
No Default = 750
Owns (n=850)Default = 300
No Default = 550
Rents (n=650)Default = 450
No Default = 200
ij
ijij
E
EO 2
2
Observed
Owns Rents
Default 300 450 750
No Default 550 200 750
850 650 1500
Expected
Owns Rents
Default 425 325 750
No Default 425 325 750
850 650 1500
![Page 7: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/7.jpg)
Chi-squared test
ij
ijij
E
EO 2
2
If the groups were the same, you’d expect an even split (Expected)
But we can see they aren’t distributed evenly (Observed)
But is it enough (i.e., statistically significant)?
325
)325200(
325
)325450(
425
)425550(
425
)425300( 22222
4.1690.480.487.367.362
0001.0p
Small p-values (i.e., less than 0.05 mean it’s very unlikely the groups are the same)
So Owns/Rents is a predictor that creates two different groups
Observed
Owns Rents
Default 300 450 750
No Default 550 200 750
850 650 1500
Expected
Owns Rents
Default 425 325 750
No Default 425 325 750
850 650 1500
![Page 8: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/8.jpg)
Cluster Analysis – Cohesion and Separation
![Page 9: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/9.jpg)
Cluster Analysis
• What do you look for in the histogram that tells you a variable should not be included in the cluster analysis?
![Page 10: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/10.jpg)
Cluster Analysis
• What do you look for in the histogram that tells you a variable should not be included in the cluster analysis?
Cluster 1 Cluster 2
2
1.313 3.3
1.5
SSE1 = 12 + 1.32 + 22 = 1 + 1.69 + 4 = 6.69
SSE2 = 32 + 3.32 + 1.52 = 9 + 10.89 + 2.25 = 22.14
![Page 11: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/11.jpg)
Separation and Cohesion
• Which is better?
Distance within clusters is minimized
Distance between clusters is maximized
![Page 12: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/12.jpg)
Segment Profile Plot
![Page 13: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/13.jpg)
Association Rules Mining
![Page 14: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/14.jpg)
Association Rules Mining
• Support count ()• In how many baskets does the itemset appear?• {Milk, Beer, Diapers} = 2
(i.e., in baskets 3 and 4)
• Support (s)• Fraction of transactions that contain all items in X Y• s({Milk, Diapers, Beer}) = 2/5 = 0.4
Basket Items
1 Bread, Milk
2 Bread, Diapers, Beer, Eggs
3 Milk, Diapers, Beer, Coke
4 Bread, Milk, Diapers, Beer
5 Bread, Milk, Diapers, Coke
![Page 15: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/15.jpg)
Confidence• Confidence is the strength of
the association• Measures how often items in Y
appear in transactions that contain X
Basket Items
1 Bread, Milk
2 Bread, Diapers, Beer, Eggs
3 Milk, Diapers, Beer, Coke
4 Bread, Milk, Diapers, Beer
5 Bread, Milk, Diapers, Coke
67.03
2
)Diapers,Milk(
)BeerDiapers,Milk,(
)(
)(
X
YXc
This says 67% of the times when you have milk and
diapers in the itemset you also have beer!
c must be between 0 and 11 is a complete association
0 is no association
![Page 16: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/16.jpg)
Lift Example
• What’s the lift for the rule:{Milk, Diapers} {Beer}
• So X = {Milk, Diapers} Y = {Beer}
s({Milk, Diapers, Beer}) = 2/5 = 0.4s({Milk, Diapers}) = 3/5 = 0.6s({Beer}) = 3/5 = 0.6
So
Basket Items
1 Bread, Milk
2 Bread, Diapers, Beer, Eggs
3 Milk, Diapers, Beer, Coke
4 Bread, Milk, Diapers, Beer
5 Bread, Milk, Diapers, Coke
11.136.0
4.0
6.0*6.0
4.0Lift
When Lift > 1, the occurrence of
X Y together is more likely than what you
would expect by chance
![Page 17: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/17.jpg)
Another exampleChecking Account
Savings Account
No Yes
No 500 3500 4000
Yes 1000 5000 6000
10000
Are people more inclined to have a checking account if they have a savings
account?
Support ({Savings} {Checking}) = 5000/10000 = 0.5Support ({Savings}) = 6000/10000 = 0.6Support ({Checking}) = 8500/10000 = 0.85Confidence ({Savings} {Checking}) = 5000/6000 = 0.83
98.051.0
5.0
85.0*6.0
5.0Lift
Answer: NoIn fact, it’s slightly less than what
you’d expect by chance!
![Page 18: The Three Analytics Techniques. Decision Trees – Determining Probability](https://reader036.vdocuments.us/reader036/viewer/2022082819/56649f305503460f94c4b33f/html5/thumbnails/18.jpg)
Final Question
• Can you have high confidence and low lift?