1 peter fox data analytics – itws-4963/itws-6965 week 5a, february 24, 2015 weighted knn, ~...

Post on 19-Dec-2015

229 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 5a, February 24, 2015

Weighted kNN, ~ clustering, trees and Bayesian

classification

Plot tools/ tipshttp://statmethods.net/advgraphs/layout.html

http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/

pairs, gpairs, scatterplot.matrix, clustergram, etc.

data()

# precip, presidents, iris, swiss, sunspot.month (!), environmental, ethanol, ionosphere

More script fragments in R will be available on the web site (http://escience.rpi.edu/data/DA )

2

Weighted KNN?require(kknn)

data(iris)

m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,

prob = rep(1/m, m))

iris.learn <- iris[-val,]

iris.valid <- iris[val,]

iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1,

kernel = "triangular")

summary(iris.kknn)

fit <- fitted(iris.kknn)

table(iris.valid$Species, fit)

pcol <- as.character(as.numeric(iris.valid$Species))

pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

3

4

Look at Lab5b_wknn_2015.R

Ctree> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

> print(iris_ctree)

Conditional inference tree with 4 terminal nodes

Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264

2)* weights = 50

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865

5)* weights = 46

4) Petal.Length > 4.8

6)* weights = 8

3) Petal.Width > 1.7

7)* weights = 46 5

plot(iris_ctree)

6

Try Lab6b_5_2014.R> plot(iris_ctree, type="simple”) # try this

Swiss - pairs

7

pairs(~ Fertility + Education + Catholic, data = swiss, subset = Education < 20, main = "Swiss data, Education < 20")

New dataset - ionosphererequire(kknn)

data(ionosphere)

ionosphere.learn <- ionosphere[1:200,]

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

# vary kernel

(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

#alter distance

(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)8

Resultsionosphere.learn <- ionosphere[1:200,]

# convenience samping!!!!

ionosphere.valid <- ionosphere[-c(1:200),]

fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid)

table(ionosphere.valid$class, fit.kknn$fit)

b g

b 19 8

g 2 1229

(fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, + kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1))

Call:

train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class)

b g

b 25 4

g 2 12010

(fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15,

+ kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2))

Call:

train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 2, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"))

Type of response variable: nominal

Minimal misclassification: 0.12

Best kernel: rectangular

Best k: 2

table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class)

b g

b 20 5

g 7 11911

However… there is more

12

Bayes> cl <- kmeans(iris[,1:4], 3)

> table(cl$cluster, iris[,5])

setosa versicolor virginica

2 0 2 36

1 0 48 14

3 50 0 0

#

> m <- naiveBayes(iris[,1:4], iris[,5])

> table(predict(m, iris[,1:4]), iris[,5])

setosa versicolor virginica

setosa 50 0 0

versicolor 0 47 3

virginica 0 3 47 13

pairs(iris[1:4],main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])

Using a contingency table> data(Titanic)

> mdl <- naiveBayes(Survived ~ ., data = Titanic)

> mdl

14

Naive Bayes Classifier for Discrete PredictorsCall: naiveBayes.formula(formula = Survived ~ ., data = Titanic)A-priori probabilities:Survived No Yes 0.676965 0.323035 Conditional probabilities: ClassSurvived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 SexSurvived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 AgeSurvived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122

Using a contingency table> predict(mdl, as.data.frame(Titanic)[,1:3])

[1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No

[26] No No No Yes Yes Yes Yes

Levels: No Yes

15

Naïve Bayes – what is it?• Example: testing for a specific item of

knowledge that 1% of the population has been informed of (don’t ask how).

• An imperfect test:– 99% of knowledgeable people test positive– 99% of ignorant people test negative

• If a person tests positive – what is the probability that they know the fact?

16

Naïve approach…• We have 10,000 representative people• 100 know the fact/item, 9,900 do not• We test them all:

– Get 99 knowing people testing knowing– Get 99 not knowing people testing not knowing– But 99 not knowing people testing as knowing

• Testing positive (knowing) – equally likely to know or not = 50%

17

Tree diagram

10000 ppl

1% know (100ppl)

99% test to know

(99ppl)

1% test not to know (1per)

99% do not know

(9900ppl)

1% test to know

(99ppl)

99% test not to know

(9801ppl)18

Relation between probabilities• For outcomes x and y there are probabilities

of p(x) and p (y) that either happened• If there’s a connection then the joint

probability - both happen = p(x,y)• Or x happens given y happens = p(x|y) or

vice versa then:– p(x|y)*p(y)=p(x,y)=p(y|x)*p(x)

• So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law)• E.g.

p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5

19

How do you use it?• If the population contains x what is the

chance that y is true?

• p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(word)

• Base this on data: – p(spam) counts proportion of spam versus not– p(word|spam) counts prevalence of spam

containing the ‘word’– p(word|!spam) counts prevalence of non-spam

containing the ‘word’ 20

Or..• What is the probability that you are in one

class (i) over another class (j) given another factor (X)?

• Invoke Bayes:

• Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known)

• So: conditional indep - 21

• P(xk | Ci) is estimated from the training samples – Categorical: Estimate P(xk | Ci) as percentage of

samples of class i with value xk

• Training involves counting percentage of occurrence of each possible value for each class

– Numeric: Actual form of density function is generally not known, so “normal” density is often assumed

22

Digging into irisclassifier<-naiveBayes(iris[,1:4], iris[,5])

table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual'))

classifier$apriori

classifier$tables$Petal.Length

plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species")

curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue")

curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green") 23

24

Decision tree (example)> require(party) # don’t get me started!

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

25

plot(iris_ctree)

26

Try Lab6b_5_2014.R> plot(iris_ctree, type="simple”) # try this

Beyond plot: pairspairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

27

Try Lab6b_2_2014.R - USJudgeRatings

Try hclust for iris

28

gpairs(iris)

29

Try Lab6b_3_2014.R

Better scatterplots

30

install.packages("car")

require(car)

scatterplotMatrix(iris)

Try Lab6b_4_2014.R

splom(iris) # default

31

Try Lab6b_7_2014.R

splom extra!require(lattice)

super.sym <- trellis.par.get("superpose.symbol")

splom(~iris[1:4], groups = Species, data = iris,

panel = panel.superpose,

key = list(title = "Three Varieties of Iris",

columns = 3,

points = list(pch = super.sym$pch[1:3],

col = super.sym$col[1:3]),

text = list(c("Setosa", "Versicolor", "Virginica"))))

splom(~iris[1:3]|Species, data = iris,

layout=c(2,2), pscales = 0,

varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"),

page = function(...) {

ltext(x = seq(.6, .8, length.out = 4),

y = seq(.9, .6, length.out = 4),

labels = c("Three", "Varieties", "of", "Iris"),

cex = 2)

})

parallelplot(~iris[1:4] | Species, iris)

parallelplot(~iris[1:4], iris, groups = Species,

horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

> Lab6b_7_2014.R

32

33

34

Using a contingency table> data(Titanic)

> mdl <- naiveBayes(Survived ~ ., data = Titanic)

> mdl

35

Naive Bayes Classifier for Discrete PredictorsCall: naiveBayes.formula(formula = Survived ~ ., data = Titanic)A-priori probabilities:Survived No Yes 0.676965 0.323035 Conditional probabilities: ClassSurvived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 SexSurvived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 AgeSurvived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122 Try Lab6b_9_2014.R

http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html

require(mlbench)

data(HouseVotes84)

model <- naiveBayes(Class ~ ., data = HouseVotes84)

predict(model, HouseVotes84[1:10,-1])

predict(model, HouseVotes84[1:10,-1], type = "raw")

pred <- predict(model, HouseVotes84[,-1])

table(pred, HouseVotes84$Class) 36

Exercise for you> data(HairEyeColor)

> mosaicplot(HairEyeColor)

> margin.table(HairEyeColor,3)

Sex

Male Female

279 313

> margin.table(HairEyeColor,c(1,3))

Sex

Hair Male Female

Black 56 52

Brown 143 143

Red 34 37

Blond 46 81

How would you construct a naïve Bayes classifier and test it? 37

Hierarchical clustering> d <- dist(as.matrix(mtcars))

> hc <- hclust(d)

> plot(hc)

38

ctree

39

require(party)

swiss_ctree <- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss)

plot(swiss_ctree)

Hierarchical clustering

40

> dswiss <- dist(as.matrix(swiss))

> hs <- hclust(dswiss)

> plot(hs)

scatterplotMatrix

41

require(lattice); splom(swiss)

42

43

44

At this point…• You may realize the inter-relation among

classification at an absolute and relative level (i.e. hierarchical -> trees…)– Trees are interesting from a decision perspective:

if this or that, then this….

• Beyond just distance measures (kmeans) to probabilities (Bayesian)

• So many ways to visualize them…45

top related