slides text ana - statistics departmentstine/mich/ta_slides_3.pdf · test model • predict color...
TRANSCRIPT
![Page 1: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/1.jpg)
Classification Methods
70
Logistic Regression Partitioning … Trees
![Page 2: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/2.jpg)
Wharton Department of Statistics
Classification Problems• Models for a categorical response
• Hate speech • Supreme Court decisions • Web ratings: Amazon star ratings, filtering phony reviews
• Techniques• Logistic regression for two, multinomial for several
Variable selection (stepwise, lasso)
• Classification trees Boosted trees, random forest
• James text summarizes modern approaches
71
parametric
nonparametric
![Page 3: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/3.jpg)
Wharton Department of Statistics
Where’s the text?• Regression with lots and lots of indicators
• Columns of document term matrix • Presents opportunities, with some evident drawbacks
• Simple choice often works well• Easily interpreted (as easy as any dummy variable) • Sets a baseline for more complex methods
• Combine with other features• No reason not to use other features if available • Examples
wine data: words from tasting notes + alcohol + vintage real estate: words from listing + square footage medicine: doctor’s notes + lab measurements
72
![Page 4: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/4.jpg)
Wharton Department of Statistics
Review: Logistic Regression• Probability model
• Two, mutually exclusive categories • Similar to linear regression in many ways • P(yi = 1| xi) = E(yi = 1| xi) = µi(β0,β1) = 1/(1+exp(-β0 - β1xi)) • Structural form has important implications
probability goes to 0/1 as |X| gets large coefficients describe log odds
• Maximum likelihood• Estimate parameters to maximize joint probability log P(y1,y2,…,yn| X) = Σi (1-yi) log (1-µi) + yi log µi
• Independence • Nonlinear least squares (iteratively reweighted least squares)
73
![Page 5: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/5.jpg)
Wharton Department of Statistics
More than two?• Examples
• Not every election is a two-party contest! Multiple candidates in a primary election
• Wine varieties Think of all the types of red wines that exist.
• Multinomial logistic regression (unordered categories)• Multinomial distribution replaces the binomial • P(yi = k| xi) = µi(β0,β1) = exp(-βk0 - βkixi)/(Σk exp(-βk0 - βkixi)) • Constrained to sum to 1
Reduces to binomial in the case of k=2 categories Interpretation of coefficients is different in this specification
74
![Page 6: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/6.jpg)
Wharton Department of Statistics
Model Selection• Which features belong in the logistic regression?
• Text presents challenge• Suppose we consider picking columns from the document-term matrix as predictive features
• Suppose we consider picking combinations of columns from the document-term matrix
• Feature selection• Selection criteria such as AIC, BIC, or stepwise choices • Number of choices overwhelm design of criteria
e.g. AIC designed to pick order of polynomial or autoregression
• Assumptions not well suited to the problem (eg “true model”) • Speed becomes limiting factor (recall nonlinear estimation)
75
![Page 7: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/7.jpg)
Wharton Department of Statistics
Penalized Selection• Problem
• Goodness-of-fit statistics like R2 always go up as add features • Maximum likelihood behaves the same way • Overfitting results
• Approach• Add a penalty to the likelihood • Adding a parameter must improve the fit more than the penalty added by increasing model complexity
• Question• How much penalty does adding a parameter incur?
76
![Page 8: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/8.jpg)
Wharton Department of Statistics
Lasso• Penalized likelihood
• Choices L0 maxβ loglike(β) – λ #{βj ≠ 0} AIC, BIC L1 maxβ loglike(β) – λ Σ |βj| L2 maxβ loglike(β) – λ Σ βj2 Ridge regr
• λ controls the amount of the penalty
• Lasso = L1 penalty
• Advantages• Fast computing because objective function is convex • Criterion sets many βj = 0, unlike ridge penalty
77
![Page 9: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/9.jpg)
Wharton Department of Statistics
Penalty Parameter• Choice of tuning parameter λ
• Really big: model is parsimonious • Really small: model has many features
• Bias-Variance tradeoff• Big models have little bias, but high variance • Small models reverse this balance
• Choice uses cross validation• Ten-fold cross-validation of the training data • Fit model to 9/10, predict the other 1/10. Repeat • Pick λ that minimizes the error
78
![Page 10: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/10.jpg)
Wharton Department of Statistics
Partitioning Models: Trees• Familiar metaphor
• Biology • Medical diagnosis • Org chart
• Structure at-a-glance
• Properties• Recursive, partitioning items into unique leaf • Increasing specialization
• How to grow a tree from data?• What rules identify the splitting variables, split points?
79
![Page 11: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/11.jpg)
Wharton Department of Statistics
Classical Example• Fisher’s iris data
• Classification tree: categorical response • 50 flowers from 3 species of iris • four variables: length and width of sepal and petal
80
Splitting rules are not unique
Splits are parallel to plot axes
Stop?
Classification tree
![Page 12: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/12.jpg)
Wharton Department of Statistics
CART™• Classification and regression trees
• A sequence of divisions of cases • Goal is to obtain homogeneous subsets • Predict new observations based on “vote” of leaf
• Classification tree• Categorical response (e.g. good/bad/indifferent) • Goal: Cases in leaf belong to one category
• Regression tree • Numerical response (e.g. profitability) • Cases in leaf have similar value of response
• Familiar likelihood objective• Choose leaves to maximize likelihood
81
![Page 13: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/13.jpg)
Wharton Department of Statistics
Simple Foundation• Bins, lots of bins
• Allow variables (characteristics) to define a large “cube” with dimensions given by Age x Employment x Residential
• Insert each observation into a bin • Score for bin is average of observations in bin
• Trade-offs• Don’t have to pick additive form, transformations • Some bins may be nearly empty, sparse • Issues remain Which characteristics? Which attributes?
bias vs
variance
82
![Page 14: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/14.jpg)
Wharton Department of Statistics
Goodness of Fit• Two general approaches
• Classification error• Confusion matrix: Count number wrong
“Millions” of summary stats: sensitivity, specificity, recall, precision, f1
• What does it mean to be wrong? • ROC curve and AUC
• Proper scoring rules• Squared error • Likelihoods
83
![Page 15: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/15.jpg)
Wharton Department of Statistics
Confusion Matrix• Confusion matrix
• Common summary table • Misclassification rate
• Sensitivity & specificity• Sensitivity = P(say positive | positive) = Recall • Specificity = P(say negative | negative) • Precision = P(positive | say positive) • F1 = 2 (precision x recall)/(precision+recall) harmonic mean
• Classification error rate• Common, but ‘coarse’
• What threshold would you use to classify?
84
claimneg pos
actual neg n11 n12
pos n21 n22
![Page 16: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/16.jpg)
Wharton Department of Statistics
ROC Curves• ROC Curve
• True positive (sensitivity) vs false positive (1-specificity) • Equivalent to Gini index • Only order matters, not the calibration
• AUC• Area under ROC curve • Interpret as probabilityfit correctly orders pair
• Points of interest?• Care about whole curve? • Economics of derivative
85
AUC = 0.935
![Page 17: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/17.jpg)
Wharton Department of Statistics
Drawing the ROC• Order cases by probabilities
• Move up if positive case
• Move rightif negative case
86
1 1 0 1 0 0Sort based on predictions
![Page 18: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/18.jpg)
Wharton Department of Statistics
Deviance • Twice the log of the likelihood ratio statistic
• Least squares regression. Assume yi ~ N(0,σ2) • Null model -2 loglike(M0) = Σ(yi)2/σ2 ~ chi-square n df = χ2n
• Regression with k estimated coefficients -2 loglike(Mk) = Σ(yi - ŷi)2/σ2 ~ χ2n-kassuming variables have true coefficient βk=0
• Change in log-likelihood when add nothing useful: -2(loglike(M0) - loglike(Mk)) ~ χ2k
• Deviance• -2 (loglike(base model) - loglike(fitted model)) ~ χ2estimated parms
87
![Page 19: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/19.jpg)
Wharton Department of Statistics
Validation• Necessary when comparing complex models
• Easy to overfit complex models Model might have more potential features than observations Eg: Occurrence of which pairs of words indicate how Justice will decide?
• Keep changing model until it fits the observed data all too well
• Validation?• Assess goodness of fit on a test set, not training data • How many?
Depends on task: are models similar
• Caution: Test set gives optimistic assessment• Population drift
88
![Page 20: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/20.jpg)
Wharton Department of Statistics
Improving Trees• Bias-variance trade-off
• Analogous to choice of smoothing parameter • Trees capture nuanced structure, but (low bias) • Trees have highly irregular structure (high var)
• Model averaging• Rather than fit one model, fit several and combine results
• Classifier: majority vote • Regression: average predictions
• Approaches• Boosting “stumps” or small trees are so-called weak learners • Bagging bootstrap resampling method
89
![Page 21: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/21.jpg)
Wharton Department of Statistics
Boosting• General method for improving any simple model
• Build sequence of predictive models...• Start with initial predictive model • Compute residuals from current fit • Build model for residuals • Repeat • Combine estimates from sequence of models
• Use simpler model at each step• Small tree (stump or bush) • Next response = (current response) - (learning rate) x fit
• Weaknesses• Loss of interpretability, at what gain?
90
Adaboostreweighting
cases
![Page 22: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/22.jpg)
Wharton Department of Statistics
Boosting Trees• Pick depth of tree (stumps), learning rate
• Use cross-validation to pick BAnalogous to picking λ for logistic models
91
James Ch 8
![Page 23: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/23.jpg)
Classification Examples
92
wine_classify.R
![Page 24: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/24.jpg)
Wharton Department of Statistics
Plan• Predicting wine color
• Two-category response • Easy for both logistic regression and tree
• Predicting the type of wine• Four-category problem • More challenging
Harder to distinguish from choices of words Fewer observations to build a model
• Judging models• Common test sample hidden from each method
93
![Page 25: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/25.jpg)
Wharton Department of Statistics
Predicting Wine Color• Red or white?
• Combine columns from DTM with other data• Indicators or counts
Do we care about how often a word was used, or just its presence?
• Lengths and proportions Is the count most relevant, or the relative frequency
• Choice of predictors is up to you!
• Note: missing data in the other features!• 10% missing vintage or price, 2.5% missing alcohol • Use same approach as in linear regression
94
no pun intended
![Page 26: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/26.jpg)
Wharton Department of Statistics
Logistic Model• Exclude test sample from all models
• Set aside 10,000 … • Why: Test accuracy, and this will make modeling harder
• Start with the classic variables• price, alcohol, vintage, missing indicators, and lengths
95
price isn’tbut
missing isInterpretation?
![Page 27: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/27.jpg)
Wharton Department of Statistics
Logistic with Words• Which words
• Start with simply using proportions of 20 most common words • Common words useful … proxies for length?
96
…much less residual deviance
![Page 28: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/28.jpg)
Wharton Department of Statistics
Logistic with Words• Which words
• Add length to the mixture • Effects still strong for common words, conditional on length
97
…
Interpret?
![Page 29: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/29.jpg)
Wharton Department of Statistics
Logistic with Both • Combine two prior models
• Observed quantitative features • Word relative frequencies + length
98
add more?
![Page 30: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/30.jpg)
Wharton Department of Statistics
Logistic with More Words • Extend prior model
• Observed quantitative features • 40 Word relative frequencies + length
99
hints of collinearity
much better fit! add more?
![Page 31: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/31.jpg)
Wharton Department of Statistics
Test Model• Predict color of wines held back in the test sample
100
Red
Whitesensitivity 0.918
specificity 0.893
precision 0.929
missclass 0.092
precision= # Red/# Claim Red recall = sensitivity = #Claim Red/# Red
![Page 32: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/32.jpg)
Wharton Department of Statistics
Calibration• Do predicted probabilities indicate actual probability?
• Hosmer-Lemeshow test • Plot adds high-degree polynomial (or loess smooth curve)
101
Not a problem if threshold at 0.5
![Page 33: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/33.jpg)
Wharton Department of Statistics
ROC Curve• Plot sensitivity on 1-specificity
• Parametric curve as vary the classification threshold
102
AUC =0.969
true positive
false positive
![Page 34: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/34.jpg)
Wharton Department of Statistics
Variable Selection• Which words
• Twenty words was good, forty was better • Keep going… we have thousands
• Try feature selection• Stepwise logistic regression is slow • Lasso in R offers fast alternative
glmnet package is very efficient
• Dimension of the DTM is a challenge these tools Estimation data has 7336 cases with 2659 word columns
• Baseline• Models already achieve in-sample residual deviance 3367
103
![Page 35: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/35.jpg)
Wharton Department of Statistics
Lasso Selection• Start with set of features from prior logistic regression
• Basic variables (alcohol, price, etc) • Proportions of top 40 words
• Fishbone plot• Coefs as reduce penalty λ • Trace each as λ –> 0 • Far right is logistic model
104analogous to ridge trace
![Page 36: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/36.jpg)
Wharton Department of Statistics
How many to use• Pick value of λ using cross validation
• 10-fold cross-validation • 10 splits of training data (not using held back test sample)
distinguish training from tuning from testing
105
number of variables in model
mininum sparse that isclose to minimum
Best model is not very sparse
Again find the “long tail” of signal in text
![Page 37: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/37.jpg)
Wharton Department of Statistics
Performance• Use sparse model within 1 SE of minimum
17 coefficients are zeroed out, leaving 31 estimates
• Similar to prior logistic regression, but with 17 fewer estimates• Not so well calibrated away from 0.5, our threshold • Confusion matrix provides matching results
106
LR Lassosensitivity 0.918 0.915specificity 0.893 0.891precision 0.929 0.928missclass 0.092 0.094
![Page 38: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/38.jpg)
Wharton Department of Statistics
Use More Words!• Cast a bigger net
• Try to use Lasso to pick from wider collection of words • Speed decreases
Initial fitting is fast, but picking λ by 10-fold CV slows the process
107
would like a progress indicator!
111 coefs
![Page 39: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/39.jpg)
Wharton Department of Statistics
What are the coefficients?• Use a word cloud, weighted by the estimates…
108
Nice to see the word
‘red’!
![Page 40: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/40.jpg)
Wharton Department of Statistics
How well did it work?• Comparison in the test set…
• Calibration getting far off target away from 0.5 • Logistic model no longer working
109
LR Lasso 40
200sens 0.918 0.915 0.982spec 0.893 0.891 0.987prec 0.929 0.928 0.991miss 0.092 0.094 0.016
Quite an improvement
![Page 41: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/41.jpg)
Wharton Department of Statistics
More?• Try with 500 words in model…
• Fitting remains fast, with CV slowing the process… but not that much.
• Similar confusion matrix
110
207
AUC ≈ 1
![Page 42: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/42.jpg)
Wharton Department of Statistics
But different words…• Similar fit, but many different words
• Collinearity becoming an issue
111
![Page 43: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/43.jpg)
Wharton Department of Statistics
Change Direction: Trees• Try a different type of model: a classification tree
Example with a few words
• Classify using majority vote
112
deviance in node
![Page 44: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/44.jpg)
Wharton Department of Statistics
Bigger Tree• Use 1000 words
• Fitting a tree is surprisingly fast • Shape conveys the value of certain words
113
![Page 45: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/45.jpg)
Wharton Department of Statistics
Some Details• Inspect the terminal nodes
114
![Page 46: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/46.jpg)
Wharton Department of Statistics
Better Tree-based Classifier• Prune tree
• Use cross-validation to remove nodes • Smaller tree often classifies better, avoiding overfitting • In this case, retains tree with 13 terminal nodes
115
![Page 47: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/47.jpg)
Wharton Department of Statistics
Boosted Trees• Smooth out the discontinuity of tree fits
• Number of distinct predictions = number of terminal nodes • Averaging over many small trees smooths predictions
116
![Page 48: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/48.jpg)
Wharton Department of Statistics
Boosted Results• Using 400 words
• Code is not so fast again as was the case with • Fitting process incorporates CV to control boosting process
That’s where code can die if a word appears in test, but not training Seems to happen in ‘bernoulli’ mode, but not for multinomial
• Fit as learning progresses
117
slower is better, but
slower is slower
![Page 49: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/49.jpg)
Wharton Department of Statistics
Boosted Performance• Using 400 words…
• Predictions range over [0,1]
• Much more competitive,but not up to level of the regression!
118
LR Lasso 40
200 BTs
ens0.918 0.915 0.982 0.969
spec 0.893 0.891 0.987 0.974prec 0.929 0.928 0.991 0.983miss 0.092 0.094 0.016 0.029
![Page 50: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/50.jpg)
Wharton Department of Statistics
Predicting Variety• Predicting wine variety
• Four-category response: cabernet, merlot, pinot, zinfandel • Smaller sample size • Much more similar in nature of descriptions
• Multinomial regression• Generalization of logistic regression to more than two groups • Trees generalize directly… just more labels
• Comparing models• Common test sample hidden from each method
119
![Page 51: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/51.jpg)
Wharton Department of Statistics
Varieties• Possible choices
• Choose top four categories of reds, 4,906 tasting notes
• Set aside validation cases, 250 for each variety• Limited by number of Zinfandels
• Build initial model using numerical features• Baseline for value of adding text • Inspect four linked models, one for each variety
120
![Page 52: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/52.jpg)
Wharton Department of Statistics
Fishbone Plots• Lasso paths for the component models
121
relevant effects vary over the
models
![Page 53: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/53.jpg)
Wharton Department of Statistics
Coefficients• At moderate shrinkage, very different estimates
evident for the different varieties• Need to choose optimal shrinkage • Relatively dense model with 7 estimates reduced to zero
122
![Page 54: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/54.jpg)
Wharton Department of Statistics
Calibration• Models for different varieties are not well calibrated
123
Merlot model is better calibrated, but also not very high probabilities
![Page 55: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/55.jpg)
Wharton Department of Statistics
Classification Results• Classifier accuracy… not very good
• 1000 test cases, 250 of each • Easy to get 25% correct without even trying! • Calls most things Cabernet
For example, it correctly identifies only 10 of the Pinots, labeling 230 Pinots as Cabernet.
124
# correct = 219+40+10+59 = 329
![Page 56: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/56.jpg)
Wharton Department of Statistics
Add Words• First 100 words
• Most common 100 word types • Many more “active” features in models
125
![Page 57: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/57.jpg)
Wharton Department of Statistics
Cross-Validate to Tune• Pick tuning parameter from 10-fold CV
126
![Page 58: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/58.jpg)
Wharton Department of Statistics
Key Words• At optimal choice for shrinkage parameter…
127
Cab Merlot Pinot Zin
Cherry -4 1 3 -1
Currant 14
Plum 8
Raspberry 10
Tannin/s -4
Pear 6
![Page 59: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/59.jpg)
Wharton Department of Statistics
Cloud View of Coefs• Scaled within each model
128
beware of warnings
![Page 60: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/60.jpg)
Wharton Department of Statistics
Classification• Much more accurate than baseline model
• Accuracy increases from 33% correct to 191 + 133 + 145 + 103 = 572 –> 57% correct
• Zinfandel is least accurate, plus fewest in training data • Still tend to classify too many as cabernet… which happens to be most common in the training data!
129
![Page 61: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/61.jpg)
Wharton Department of Statistics
Increase to 200 Words• Choice of shrinkage parameter very clear
• Evident trough indicating best choice for λ
130
![Page 62: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/62.jpg)
Wharton Department of Statistics
Coefficient Clouds• Several new terms not available to prior model
131
![Page 63: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/63.jpg)
Wharton Department of Statistics
Classification• Not much different from prior model (57% correct)
• with 100 words
• with 200 words
132
![Page 64: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/64.jpg)
Wharton Department of Statistics
Go Further?• Lots more words to try
• Tried with 400 words• Takes quite a bit longer to run, but works. Again clear trough • Some new word types appear… looks like we need to be more careful with preparing our data (next slide)
• Plus, have not explore the importance of combinations of words
• 2500 words –> 3,125,000 possible (though many would be 0) • Other features based on the words present
133
![Page 65: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/65.jpg)
Wharton Department of Statistics
New Words?• Surprise, surprise!
134
![Page 66: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/66.jpg)
Wharton Department of Statistics
Classification• No surprising either, this gets better
• Percent correct up from 57% to 64%
135
What about all of the other words that are available?
![Page 67: slides text ana - Statistics Departmentstine/mich/TA_slides_3.pdf · Test Model • Predict color of wines held back in the test sample 100 Red White sensitivity 0.918 specificity](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052ed4de07cc95d8940df87/html5/thumbnails/67.jpg)
Wharton Department of Statistics
Results for Trees• Resemble those obtained from multinomial
regression…
• See the associated commands in the R script.
136