1 peter fox data analytics – itws-4963/itws-6965 week 5a, february 24, 2015 weighted knn, ~...

45
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Upload: shavonne-alexander

Post on 19-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 5a, February 24, 2015

Weighted kNN, ~ clustering, trees and Bayesian

classification

Page 2: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Plot tools/ tipshttp://statmethods.net/advgraphs/layout.html

http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/

pairs, gpairs, scatterplot.matrix, clustergram, etc.

data()

# precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere

More script fragments in R will be available on the web site (http://escience.rpi.edu/data/DA )

2

Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Weighted KNN?require(kknn)

data(iris)

m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,

prob = rep(1/m, m))

iris.learn <- iris[-val,]

iris.valid <- iris[val,]

iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1,

kernel = "triangular")

summary(iris.kknn)

fit <- fitted(iris.kknn)

table(iris.valid$Species, fit)

pcol <- as.character(as.numeric(iris.valid$Species))

pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

3

Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

4

Look at Lab5b_wknn_2015.R

Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 5

Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

plot(iris_ctree)

6

Try Lab6b_5_2014.R> plot(iris_ctree, type="simple”) # try this

Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Swiss - pairs

7

pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

New dataset - ionosphererequire(kknn)

data(ionosphere)

ionosphere.learn <- ionosphere[1:200,]

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

# vary kernel

(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

#alter distance

(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)8

Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Resultsionosphere.learn <- ionosphere[1:200,]

# convenience samping!!!!

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

b g

b 19 8

g 2 1229

Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

Call:

train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

b g

b 25 4

g 2 12010

Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

+ kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

Call:

train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 2, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

b g

b 20 5

g 7 11911

Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

However… there is more

12

Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Bayes> cl <- kmeans(iris[,1:4], 3)

> table(cl$cluster, iris[,5])

setosa versicolor virginica

2 0 2 36

1 0 48 14

3 50 0 0

#

> m <- naiveBayes(iris[,1:4], iris[,5])

> table(predict(m, iris[,1:4]), iris[,5])

setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47 13

pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Using a contingency table> data(Titanic)

> mdl <- naiveBayes(Survived ~ ., data = Titanic)

> mdl

14

Naive Bayes Classifier for Discrete PredictorsCall: naiveBayes.formula(formula = Survived ~ ., data = Titanic)A-priori probabilities:Survived No Yes 0.676965 0.323035 Conditional probabilities: ClassSurvived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 SexSurvived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 AgeSurvived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122

Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Using a contingency table> predict(mdl, as.data.frame(Titanic)[,1:3])

[1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No

[26] No No No Yes Yes Yes Yes

Levels: No Yes

15

Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Naïve Bayes – what is it?• Example: testing for a specific item of

knowledge that 1% of the population has been informed of (don’t ask how).

• An imperfect test:– 99% of knowledgeable people test positive– 99% of ignorant people test negative

• If a person tests positive – what is the probability that they know the fact?

16

Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Naïve approach…• We have 10,000 representative people• 100 know the fact/item, 9,900 do not• We test them all:

– Get 99 knowing people testing knowing– Get 99 not knowing people testing not knowing– But 99 not knowing people testing as knowing

• Testing positive (knowing) – equally likely to know or not = 50%

17

Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Tree diagram

10000 ppl

1% know (100ppl)

99% test to know

(99ppl)

1% test not to know (1per)

99% do not know

(9900ppl)

1% test to know

(99ppl)

99% test not to know

(9801ppl)18

Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Relation between probabilities• For outcomes x and y there are probabilities

of p(x) and p (y) that either happened• If there’s a connection then the joint

probability - both happen = p(x,y)• Or x happens given y happens = p(x|y) or

vice versa then:– p(x|y)*p(y)=p(x,y)=p(y|x)*p(x)

• So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law)• E.g.

p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5

19

Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

How do you use it?• If the population contains x what is the

chance that y is true?

• p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(word)

• Base this on data: – p(spam) counts proportion of spam versus not– p(word|spam) counts prevalence of spam

containing the ‘word’– p(word|!spam) counts prevalence of non-spam

containing the ‘word’ 20

Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Or..• What is the probability that you are in one

class (i) over another class (j) given another factor (X)?

• Invoke Bayes:

• Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known)

• So: conditional indep - 21

Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

• P(xk | Ci) is estimated from the training samples – Categorical: Estimate P(xk | Ci) as percentage of

samples of class i with value xk

• Training involves counting percentage of occurrence of each possible value for each class

– Numeric: Actual form of density function is generally not known, so “normal” density is often assumed

22

Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Digging into irisclassifier<-naiveBayes(iris[,1:4], iris[,5])

table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual'))

classifier$apriori

classifier$tables$Petal.Length

plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species")

curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue")

curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green") 23

Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

24

Page 25: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Decision tree (example)> require(party) # don’t get me started!

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

25

Page 26: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

plot(iris_ctree)

26

Try Lab6b_5_2014.R> plot(iris_ctree, type="simple”) # try this

Page 27: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Beyond plot: pairspairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

27

Try Lab6b_2_2014.R - USJudgeRatings

Page 28: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Try hclust for iris

28

Page 29: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

gpairs(iris)

29

Try Lab6b_3_2014.R

Page 30: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Better scatterplots

30

install.packages("car")

require(car)

scatterplotMatrix(iris)

Try Lab6b_4_2014.R

Page 31: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

splom(iris) # default

31

Try Lab6b_7_2014.R

Page 32: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

splom extra!require(lattice)

super.sym <- trellis.par.get("superpose.symbol")

splom(~iris[1:4], groups = Species, data = iris,

panel = panel.superpose,

key = list(title = "Three Varieties of Iris",

columns = 3,

points = list(pch = super.sym$pch[1:3],

col = super.sym$col[1:3]),

text = list(c("Setosa", "Versicolor", "Virginica"))))

splom(~iris[1:3]|Species, data = iris,

layout=c(2,2), pscales = 0,

varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),

page = function(...) {

ltext(x = seq(.6, .8, length.out = 4),

y = seq(.9, .6, length.out = 4),

labels = c("Three", "Varieties", "of", "Iris"),

cex = 2)

})

parallelplot(~iris[1:4] | Species, iris)

parallelplot(~iris[1:4], iris, groups = Species,

horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

> Lab6b_7_2014.R

32

Page 33: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

33

Page 34: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

34

Page 35: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Using a contingency table> data(Titanic)

> mdl <- naiveBayes(Survived ~ ., data = Titanic)

> mdl

35

Naive Bayes Classifier for Discrete PredictorsCall: naiveBayes.formula(formula = Survived ~ ., data = Titanic)A-priori probabilities:Survived No Yes 0.676965 0.323035 Conditional probabilities: ClassSurvived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 SexSurvived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 AgeSurvived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 Try Lab6b_9_2014.R

Page 36: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html

require(mlbench)

data(HouseVotes84)

model <- naiveBayes(Class ~ ., data = HouseVotes84)

predict(model, HouseVotes84[1:10,-1])

predict(model, HouseVotes84[1:10,-1], type = "raw")

pred <- predict(model, HouseVotes84[,-1])

table(pred, HouseVotes84$Class) 36

Page 37: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Exercise for you> data(HairEyeColor)

> mosaicplot(HairEyeColor)

> margin.table(HairEyeColor,3)

Sex

Male Female

279 313

> margin.table(HairEyeColor,c(1,3))

Sex

Hair Male Female

Black 56 52

Brown 143 143

Red 34 37

Blond 46 81

How would you construct a naïve Bayes classifier and test it? 37

Page 38: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Hierarchical clustering> d <- dist(as.matrix(mtcars))

> hc <- hclust(d)

> plot(hc)

38

Page 39: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

ctree

39

require(party)

swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)

plot(swiss_ctree)

Page 40: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

Hierarchical clustering

40

> dswiss <- dist(as.matrix(swiss))

> hs <- hclust(dswiss)

> plot(hs)

Page 41: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

scatterplotMatrix

41

Page 42: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

require(lattice); splom(swiss)

42

Page 43: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

43

Page 44: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

44

Page 45: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification

At this point…• You may realize the inter-relation among

classification at an absolute and relative level (i.e. hierarchical -> trees…)– Trees are interesting from a decision perspective:

if this or that, then this….

• Beyond just distance measures (kmeans) to probabilities (Bayesian)

• So many ways to visualize them…45