linear classification methods stat 597 e fengjuan xuan caimiao wei bogdan ilie

LINEAR CLASSIFICATION METHODS

STAT 597 E Fengjuan Xuan

Caimiao WeiBogdan Ilie

Introduction

• The observations in the dataset we will work on (“BUPA liver disorders”) were sampled by BUPA Medical Research Ltd and consist of 7 variables and 345 observed vectors. The first 5 variables are measurements taken by blood tests that are thought to be sensitive to liver disorders and might arise from excessive alcohol consumption. The sixth variable is a sort of selector variable. The subjects are single male individuals. The seventh variable is a selector on the dataset, being used to split it into two sets, indicating the class identity. Among all the observations, there are 145 people belonging to the liver-disorder group (corresponding to selector number 2) and 200 people belonging to the liver-normal group.

Description of variables• The description of each variable is below:• 1. mcv mean corpuscular volume• 2. alkphos alkaline phosphotase• 3. sgpt alamine aminotransferase• 4. sgot aspartate aminotransferase• 5. gammagt gamma-glutamyl transpeptidase• 6. drinks number of half-pint equivalents of alcoholic beverages• drunk per day• 7. selector field used to split data into two sets. It is a binary

categorical variable with indicators 1 and 2 ( 2 corresponding to liver disorder)

Matrix Plot of the variables

Logistic regression in full Space

• Coefficients:• Value Std. Error t value • (Intercept) 5.99024204 2.684250011 2.231626• mcv -0.06398345 0.029631551 -2.159301• alk -0.01952510 0.006756806 -2.889694• sgpt -0.06410562 0.012283808 -5.218709• sgot 0.12319769 0.024254150 5.079448• gammagt 0.01894688 0.005589619 3.389656• drinks -0.06807958 0.040358528 -1.686870

• So the classification rule is: G(x)=

0X6* X5* X4* X3* X2* X1* ,2

0X6* X5* X4* X3* X2* X1* ,1

7654321

7654321

if

if

Classification error rate

• the classification error on the whole training data set.

• error rate: 0.2956• Sensitivity: 0.825• Specificity: 0.5379

The error rate and it’s standard error obtained by 10-fold cross validation

• error rate:(Standard Error) 0.307461384336384 (0.0271)

• Sensitivity:(Standard Error) 0.816280482802222 (0.0203)• Specificity:(Standard Error) 0.531134992458522 (0.0699)

Backward step wise model selection based on AIC

• Five variables are selected after step-wise model selection. The first

variable MCV is deleted. • error rate:(Standard Error) 0.329460817156602 (0.03051)• Sensitivity:(Standard Error) 0.792109881015521 (0.03433)• Specificity:(Standard Error) 0.507341628959276 (0.03863)

• COMMENT:• This method has a larger classification error rate than the original

one. Using stepwise doesn’t improve classification

Scree plot for the PCA0

50

010

00

15

00

Comp. 1 Comp. 2 Comp. 3 Comp. 4 Comp. 5 Comp. 6

prin.liv

Vari

an

ce

s

0.713

0.8510.972

0.988 0.996 1

The performance of the Logistic regression on the reduced space

• The reduced space is obtained by selecting the first three principle components. The standard error is obtained by 10 fold cross validation.

• error rate:(Standard Error) 0.456256232089833 (0.023414)

• Sensitivity:(Standard Error) 0.372869939127443 (0.031675)• Specificity:(Standard Error) 0.783003663003663 (0.030785)

• Comment:• the classification error rate is around 50%, which is not much better

than the random guessing.

x1[class == 1]

x2

[cla

ss =

= 1

]

0 50 100 150 200 250

-60

-40

-20

02

0

The classification plot on the first two principle components plane

Linear Discriminant Analysis

• LDA assumes a multivariate normal distribution, so we make some log transformations on some variables.

• Y1=mac & Y2=log(alk) • Y3=log(sgpt) & Y4=log(sgpt)• Y5=log(gammat) & Y6=log(dringks+1)

0 50 100 150

050

100

150

liver$sgpt

1 2 3 4 5

05

01

00

15

0

newliver$sgpt

The histogram of the sgpt variable and its log transformation

The performance of the LDA based on Transformed data

• Comment: the classification error is the smallest among all methods and the sensitivity is the largest

• error rate: 0.263768115942029

• Sensitivity: 0.865• Specificity: 0.558620689655172

• By the log transformation, we make the assumption of multivariate normality reasonable. So the classification becomes better.

LDA after PCA

• error rate: 0.411594202898551

• Sensitivity: 0.88• Specificity: 0.186206896551724

• Comment:

the performance is not improved by PCA

Conclusion• Four different methods are applied to the liver disorder

data set. The LDA based on the transformed variables works best and the Logistic regression based on the original data set second.

• The classification method based on the principle component doesn’t work well. Although the first three principle components contain more than 97% variation, we may still lose the most important information for classification.

• The transformations can make the LDA method work better in some cases. The LDA assumes the normality distribution which is a very strong assumption in many data sets. For example, in our data, all variables except the first one are seriously skewed. That is why log transform works.

linear classification methods stat 597 e fengjuan xuan caimiao wei bogdan ilie

Documents