examining outlier detection performance for principal components analysis method and its...

7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

1/10

International Journal of Advances in Engineering & Technology, May 2013.

IJAET ISSN: 2231-1963

573 Vol. 6, Issue 2, pp. 573-582

EXAMINING OUTLIERDETECTION PERFORMANCE FOR

PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS

ROBUSTIFICATION METHODS

Nada Badr, Noureldien A. Noureldien

Department of Computer Science

University of Science and Technology, Omdurman, Sudan

ABSTRACT

Intrusion detection has gasped the attention of both commercial institutions and academic research area. In thispaper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate

outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robustestimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two

different robustification techniques for the PCA. The results obtained from experiments show that PCA

generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much

accurate and both reveals the effects of masking and swamping undergo the PCA method.

KEYWORDS:Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance

Determinant, Projection Pursuit.

I. INTRODUCTIONPrincipal Components Analysis (PCA) is a multivariate statistical method that concerned withanalyzing and understanding data in high dimensions, that is to say, PCA method analyzes data setsthat represent observations which are described by several dependent variables that are intercorrelated. PCA is one of the best known and most used multivariate exploratory analysis technique[5].Several robust competitors to classical PCA estimators have been proposed in the literature. A naturalway to robustify PCA is to use robust location and scatter estimators instead of the PCA's samplemean and sample covariance matrix when estimating the eigenvalues and eigenvectors of thepopulation covariance matrix. The minimum covariance determinant (MCD) method is a highlyrobust estimator of multivariate location and scatter. Its objective is to find h observations out ofnwhose covariance matrix has the lowest determinant. The MCD location estimate then is the mean ofthese h points, and the estimate of scatter is their covariance matrix. Another robust method for

principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data ona lower-dimensional space such that a robust measure of variance of the projected data will bemaximized.In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, byapplying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD

and PP.The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 wasdedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4.In section 5 the experiment results are shown, conclusions and future work are drawn in section 6.

II. RELATED WORKA number of researches have utilized principal components analysis to reduce the dimensionality andto detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced


2/10


3/10



575 Vol. 6, Issue 2, pp. 573-582

3.1 PCA Advantages

PCA common advantages are:

3.1.1 Exploratory Data Analysis

PCA is mostly used for making 2-dimensional plots of the data for visual examination and

interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs ofPrincipal Components chosen among the first ones (that is, the most significant ones). From theseplots, one will try to extract information about the data structure, such as the detection of outliers

(observations that are very different from the bulk of the data).Due to most researches [8][11], PCA detect two types of outliers, type(1): the outlier that inflate

variance and this is detected by the major PCs and type (2): outlier that violate structure, which aredetected by minor PCs.

3.1.2 Data Reduction Technique

All multivariate techniques are prone to the bias variance tradeoff, which states that thenumber of variables entering a model should be severely restricted. Data is often describedby many more variables than necessary for building the best model. PCA is better than

other statistical reduction techniques in that, it select and feed the model with reducednumber of variables.

3.1.3 Low Computational Requirement

PCA needs low computational efforts since its algorithm constitutes simple calculations.

3.2 PCA Disadvantages

It may be noted that the PCA is based on the assumptions that, the dimensionality of data can beefficiently reduced by linear transformation and most information is contained in those directionswhere input data variance is maximum.As it is evident, these conditions are by no means always met. For example, if points of an input set

are positioned on the surface of a hyper sphere, no linear transformation can reduce dimension(nonlinear transformation, however, can easily cope with this task). From the above the followingdisadvantage of PCA are concluded.

3.2.1 Depending On Linear Algebra

It relies on simple linear algebra as its main mathematical engine, and is quite easy to interpret

geometrically. But this strength is also a weakness, for it might very well be that other syntheticvariables, more complex than just linear combinations of the original variables, would lead to a morecomplex data description.

3.2.2 Smallest Principal Components Have No Attention in Statistical Techniques

The lack of interest is due to the fact that, compared with the largest principal components that

contain most of the total variance in the data, the smallest principal components only contain thenoise of the data and, therefore, appear to contribute minimal information. However, because outliers

are a common source of noise, the smallest principal components should be useful for outlierdetection.

3.2.3 High False Alarms

Principal components are sensitive to outliers, since the principal components are determined bytheir directions and calculated from classical estimator such classical mean and classical covariance

or correlation matrices.

IV. PCAROBUSTIFICATIONIn real datasets, it often happens that some observation are different from the majority, suchobservation are called outliers, intrusion, discordant, etc. However classical PCA method can be


4/10



576 Vol. 6, Issue 2, pp. 573-582

affected by outliers so that PCA model cannot detect all the actual real deviating observation, this isknown as masking effect. In addition some good data points might even appear to be outliers whichare known as swamping effect .

Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarms

using robust estimators was proposed, since outlying points are less likely to enter into the

calculation of the robust estimators.

The well-known PCA Robustification methods are the minimum covariance determinant (MCD) andProjection-Pursuit (PP) principle. The objective of the raw MCD is to find h > n/2 observations outofn whosecovariance matrix has the smallest determinant. Its breakdown value is (bn= [n- h+1]/n),hence the number h determines the robustness of the estimator. InProjection-Pursuit principle [3],one projects the data on a lower-dimensional space such that a robust measure of variance of theprojected data will be maximized. PP is applied where the number of variables or dimensions is very

large, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset notto exceed 50 dimensions.Principal Component Analysis (PCA) is an example of the PP approach, because they both search fordirections with maximal dispersion of the data projected on it, but PP instead of using variance as

measure of dispersion, they use robust scale estimator [4].

V. EXPERIMENTS AND RESULTSIn this section we show how we test PCA and its robustification methods MCD and PP on a dataset.The data that was used consist of OD (Origin-Destination) flows which, are collected and made

available by Zhang [1]. The dataset is an extraction of sixty minutes traffic flows from first week ofthe traffic matrix on 2004-03-01, which is the traffic matrix Yin Zhang was built from Abilenenetwork. Availability of the dataset is on offline mode, where it is extracted from offline trafficmatrix.

5.1 PCA on Dataset

At first, the dataset or the traffic matrix is arranged into the data matrix X, where rows representobservations and columns represent variables or dimensions.

X (14412) =[, , , ,],

The following steps are considered in apply PCA method on the dataset.

Centering the dataset to have zero mean, so the mean vector is calculated from the followingequation: = (1)

and subtracted off the mean for each dimension.The product of this step is another centered data matrix Y, which has the same size as original dataset

, , (2) Covariance matrix is calculated from the following equation: . (3) Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonal

elements of the matrix by using eigen-decomposition technique in equation (4). YE = (4)Where E is the eigenvectors, is the eigenvalues .

Ordering eigenvalues in decreasing order and sorting eigenvectors according to the orderedeigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix.

Calculating scores matrix (dataset projected on principal components), which declares therelations between principal components and observations. The scores matrix is calculated from

the following equations:, , , (5)


5/10



577 Vol. 6, Issue 2, pp. 573-582

Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, dataprojected on minor PCS) to reveal outliers automatically. The ellipse is defined by these datapoints whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees of

freedom. The form of the distance is ,.75 (6)The screeplot is used and studied and the first and the second principal components accounted for98% of total variance of the dataset, so retaining the first two principal components to represent thedataset as whole, figure (1) shows the screeplot, the plotting of the data projected onto the first twoprincipal components in order to reveal the outliers on the dataset visually is shown in figure (2).

Figure 1: PCA Screeplot Figure 2: PCA Visual outliers

Figure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording ofoutliers from scatter plots of data projected on robust minor principal components and the outliersdetected by robust minor principal components tuned by tolerance ellipse respectively.

Figure 3: PCA Tolerance Ellipse Figure 4: PCA type2 Outliers

.

0 2 4 6 8 10 120

10

20

30

40

50

60

70

80

90

100

principal components

totalvariance

variances

-2 -1 0 1 2 3 4 5 6 7

x 107

-1

-0.5

0

0.5

1

1.5

2x 10

7 data projected on major pcs

PC1

PC2

66

120

119

135

676871 7577788283

86

878889 90

9698

101103105

111112113115

126

127128

132

134136139141

125

129

130 131144

124

116

117 118

58606465 76798081 82

84

859192939495107108109110 114115121

137138140

142143

123456789101112131415161718192021222324252627282930313233343536373839404243444546474849505152535455565758596061626365106122123

133

6972102

73747370

104

-4 -2 0 2 4 6

x 107

-5

0

5

10

15

x 106

PC1

PC2

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

616263646576 777879808182

84

859192939495106107108109110 111112113114115

121122123

124

125 130

137138

139140141

142143

66

6768697071727374

7583

86

87888990

969798

99100101

102103

104105

116

117 118

119

120

126

127128

129

131

132

133

134

135

136

144

Tolerance ellipse (97.5%)

-8 -6 -4 -2 0 2 4 6

x 105

-6

-4

-2

0

2

4

6

8x 10

5 data projected on minor pcs

last PC-1

lastPC

12345 678910

11

12131415

16

171819202122232425

26

2728293031 3233343536

37383940

41

4243444546474849505152535455

5657585960

6162636465

66

6768

70

71

72

73

74

75

7778 798081

8283

8485

8788

8990

91

92939495

99100

101

102103

104105

106

107108

109110

111

112113114115

116

117118

119

120

121

122123124

125

126

127128

129130

131

132133134135

136

137138

139140

141

142143

144

86

76

98

96


6/10



578 Vol. 6, Issue 2, pp. 573-582

Figure 5: Tuned Minor PCS

5.2 MCD on Dataset

Testing robust statistics MCD (Minimum Covariance Determinant) estimator yields robust locationmeasure Tmcdand robust dispersion mcd.The following steps are applied to test MCD on the dataset in order to reach the robust principalcomponents.

MCD measure is calculated from the formula:R=(xi-Tmcd(X))T.inv(mcd(X)).(xi-Tmcd(X) ) for i=1 to n (7)Tmcd or mcd =1.0e+006 *

From robust covariance matrix mcd calculating the followings:C(X)mcd or (x)mcd = 1.0e+012 *

* find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h* find robust eigenvectors as loading matrix as in equation (5).

Calculating robust scores matrix as in the following form, , , (8)The robust screeplot retaining the first two robust principal components which accounted above of98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visual

recording of outliers from scatter plots of data projected on robust major principal components, andthe outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9)and (10) shows the visual recording of outliers from scatter plots of data projected on robust minorprincipal components and the outliers detected by robust minor principal components tuned bytolerance ellipse respectively.

Figure 6: MCD screeplot Figure 7: MCD Visual Outliers

-6 -4 -2 0 2 4

x 105

-4

-2

0

2

4

6

x 105

PC11

PC12

12345 678910

11

12131415

16

171819202122232425

26

2728293031 3233343536

37383940

41

4243444546474849505152535455

56

57585960

6162636465

76

7778 798081

82

84

85

91

92939495106

107108

109110

111

112113

114115

121

122123124

125

130

137138

139140

141

14214366

6768

6970

71

72

73

74

75

83

86

8788

8990

96

9798

99100

101

102103

104

105

116

117118

119

120

126

127128

129

131

132133134135

136

144


0 2 4 6 8 10 120

10

20

30

40

50

60

70

80

90

100robust mcd screeplot to retain robust PCS

principal components

totalvariance

-8 -7 -6 -5 -4 -3 -2 -1 0 1

x 107

-1

-0.5

0

0.5

1

1.5

2

2.5x 10

7

robustmcd PC1

robustmcdPC2

major pcs from robust estimator

135

119

120

66

116

118117

129129130 125

124

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656768

69 70 7172 7374 75 767778 798081

8283

84

8586

8788 8990 919293949596

979899100 101

102 103104 105 106107108109110111112113114115

121122123

127128

132136

137138

139140141

142143

134

133

131

104104


7/10



579 Vol. 6, Issue 2, pp. 573-582

Figure 8: MCD Tolerance Ellipse Figure 9: MCD type2 Outliers

Figure 10: MCD Tuned Minor PCs

5.3 Projection Pursuit on Dataset

Testing the projection pursuit method on the dataset is included in the following steps:

Center the data matrix X(n,p) , around L1-median to reach centralized data matrix Y(n,p) as :, , 1 9Where L1(X) is high robust estimator of multivariate data location with 50% resist of outliers [11].

Construct the directions pi as normalized rows of matrix , `this process include the following: , : ,1: 10 max() 11Where SVD stand for singular value decomposition.

12

Project all dataset on all possible directions. (13) Calculate robust scale estimator for all the projections and find the directions that maximize qn

estimator, m a x() 14qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two datapoints [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared ofvalue of the robust scale estimator is the eigenvalues.

project all data on the selected direction q to obtain robust principal components as in thefollowing: , (15)

Update data matrix by its orthogonal complement as in the followings: ( ). (16)

-6 -4 -2 0 2 4

x 107

-5

0

5

10

15

20

x 106

robustmcdPC1

robus

tmc

dPC2

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364657981

84

859194106107108109110114121

122123

124

125

66

6768

69 707172 7374 75 767778808283

86

8788 8990 92939596

9798

99100 101102 103

104105 111112113115

116

117118

119

120

126

127128

129130

131

132133

134

135

136

137138

139140141

142143

144


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

x 106

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2x 10

6 data project on robustmcd minor PCS

robustmcd last-1 pc

robustmcd

lastpc

116

96131

717069

1019798

99100

66

120119

848576118

117

86

73

67

74141

91

81126

136144

102134

102104

136139

61248026

444444

1131128888

56

-2.5 -2 -1.5 -1 -0.5 0 0.5 1

x 106

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

x 106

robustmcd pclast-1

robustmcdpclast

12345678910111213141516171819202122232425

2627282930313233343536373839404142434445464748495051525354555657585960 61

62636465

79

81

848591

94106107108109110114

121122123124125

66

6768

697071

72

73

74

75

76

7778 808283

86

8788

8990929395

96

9798

99100

101102

103104105111

112113

115

116

117

118

119120

126

127128129130

131

132133

134

135136

137138

139140

141

142143144



8/10



580 Vol. 6, Issue 2, pp. 573-582

Project all data on the orthogonal complement, (17)The Plotting of the data projected on the first two robust principal components to detect outliers

visually, is shown in figure (11), and the tuning the first two robust principal components bytolerance ellipse is shown in figure (12). Figures (13) and (14) show respectively the plotting of

the data projected on minor robust principal components to detect outliers visually, and the tuningof the last robust principal components by tolerance ellipse.

Figure 11: PP Visual Outliers Figure 12: PP Tolerance Ellipse

Figure 13: MCD type2 Outliers Figure 14: MCD Tuned Minor PCs

5.4 Results

Table (1) summarizes the outliers detected by each method. The table shows that PCA suffers from

both masking and swamping. The MCD and PP methods results reveal the effects of masking and

swamping of the PCA method. The PP method results are similar to MCD with slight differencesince we use 12 dimensions on the dataset.

Table 1: Outliers Detection

PCA Outlier

detected by major

and Minor PCS

MCD Outliers

detected by major and

minor PCS

PP Outliers

detected by major

and minor PCS

False alarms effects

Masking Swamping

66 66 66 No No

99 99 99 No No

100 100 100 No No

116 116 116 No No

117 117 117 No No

118 118 118 No No

119 119 119 No No120 120 120 No No

-1 0 1 2 3 4 5 6 7 8

x 107

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1x 10

7 data projected on robust major PCS by PP method

PProbust PC1

PP

robustPC2

66

6768

6970

71

727374

75

767778

798081

82838485

868788

8990

9192939495

96

979899100

101

102

103104

105

107111112113

114115

116

117

118

119

120

121 126

127128

129130

131132

133134

135

136

137138139

140141

142143

144

-4 -2 0 2 4 6

x 107

-4

-3

-2

-1

0

1

x 107

PProbust PC1

PProbustPC2

15 1373419363 7980

13487

14422622352023144947502948593033321817431092554422455274528110525360644106

9088

1425712264131265123465851268407

89

39

78

383731

77

1092

21541138

93949

958382

96132

143107

84

56108

128

11

73

13186

1401213616

127

61126

124

85

103

114139

72

81130

118

133

141

41115

102

75

129

125

117

91

71

74

112113136

105101

6768

111

104

76

135116

97981009970

69

66

119

120


-3 -2 -1 0 1 2 3 4

x 106

-2

-1.5

-1

-0.5

0

0.5

1

1.5x 10

6 data projected on robust minor PCS by PP

PProbust PC11

PP

robustPC12

99100

12345678910111213141516171819202122232425262728293031323334353637383940

41

42434445464748495051525354555657585960

6162636465

6768

70

7273

77787980

818283

8485

868788

91

92939495

102103

106107108109110

114121122123

127128131132

133

134137138139140

141

14214314412345678910

111213141516171819202122232425262728293031323334353637383940

41

42434445464748495051525354555657585960

6162636465

6768

71

7273

74

75

76

77787980

81828386

87888990

91

9293949596

101

102103

104105

106107108109110

111112113

114115

117118

121122123

124125126 127

129130

131132

133

134

136

137138139140

142143144

135

119

120

9797

116

-2 -1 0 1 2 3

x 106

-1.5

-1

-0.5

0

0.5

1

x 106

PProbust PC11

PProbu

stPC12

1513734193637980134871442262235202314494750294859303332181743109255442245527452811052536064410690 88142571226413126512346585126840789 39 78383731771092215411389394 995

838296

1321431078456

108

1281173131

86140121

3616 12761126124 85

103114 139

72

81

130 118

133

141

4111510275

129 125117

917174 112113 136105101

6768

111104

76

135

116

979810099 70

69 66

119

120



9/10



581 Vol. 6, Issue 2, pp. 573-582

129 129 129 No No

131 131 131 No No

135 135 135 No No

Normal Normal 69 Yes No

Normal Normal 70 Yes No

71 Normal normal No Yes

76 Normal normal No Yes81 Normal normal No Yes





Normal 84 normal Yes No

Normal 96 normal Yes No

Normal 97 97 Yes No

Normal 98 98 Yes No

VI. CONCLUSION AND FUTURE WORKThe study has examined the PCA and its robustification methods (MCD, PP) performance forintrusion detection by presenting the bi-plots and extracted outlying observation that are verydifferent from the bulk of data. The study showed that tuned results are identical to visualized one.The study returns the PCA false alarms shortness due to masking and swamping effect. The

comparison proved that PP results are similar to MCD with slight difference in outliers type 2 sinceare considered as source of noise. Our future work will go into applying the hybrid method

(ROBPCA), which takes PP as reduction technique and MCD as robust measure for furtherperformance, and applying dynamic robust PCA model with regards to online intrusion detection.

REFERENCES

[1]. Abilene TMs, collected by Zhang .www.cs.utexas.edu/yzhang/research, visited on 13/07/2012

[2]. Khalid Labib and V.Rao Vemuri. "An application of principal Components analysis to the detection

and visualization of computer network ". Annals of telecommunications, pages 218-234, 2005 .

[3]. C. Croux, A. Ruiz-Gazen, "A fast algorithm for robust principal components based on projection

pursuit", COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,1996, 211

217.

[4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. "Anovel anomaly detection

scheme based on principal components classifier". In proceedings of the IEEE foundations and New

directions of Data Mining workshop, in conjuction with third IEEE international conference on data mining

(ICOM03) .

[5]. J.Edward Jackson . "A user guide to principal components". Wiely interscience Ist edition 2003.

[6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. "Diagnosing network wide traffic anomalies".Proceedings of the 2004 conference on Applications, technologies, architectures, protocols for computercommunication. ACM 2004.

[7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. "Efficient Intrusion

Detection Using Principal Component Analysis ". La londe, France, June 2004.

[8]. R.Gnandesikan, "Methods for statistical data analysis of multivariate observations". Wiely-interscience

publication New York, 2nd edition 1997.

[9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, "Multivariate SVD analysis for a network

anomaly detection ". In proceedings of the ACM SIGOMM Conference 2005.

[10]. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, "Netwok traffic analysis using singular

value decomposition and multiscale transforms ". information sciences : an international journal 2007.
http://www.cs.utexas.edu/yzhang/http://www.cs.utexas.edu/yzhang/http://www.cs.utexas.edu/yzhang/http://www.cs.utexas.edu/yzhang/


10/10



582 Vol. 6, Issue 2, pp. 573-582

[11]. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network ,2nd edition

2007.

[12]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, "Processing of massive audit data streams for realtime anomaly intrusion detection". Computer communications , Elsevier 2008.

[13]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis of

network traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 2004.

AUTHORS BIOGRAPHIES

Nada Badr earned her BSC in Mathematical and Computer Science at University of

Gezira, Sudan. She received the MSC in Computer Science at University of Science andTechnology. She is pursuing her PHD in Computer Science at University of Science and

Technology, Omdurman, Sudan. She currently serving lecturer at the University of

Science and Technology, Faculty of Computer Science and Information Technology.

Noureldien A. Noureldien is working as an associate professor in Computer Science,

department ofComputer Science and Information Technology, University of Science andTechnology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School of

Mathematical Sciences, University of Khartoum, and received his PhD in Computer

Science in 2001 from University of Science and Technology, Khartoum, Sudan. He has

many papers published in journals of repute. He currently working as the dean of the

Faculty of Computer Science and Information Technology at the University of Scienceand Technology, Omdurman, Sudan.

examining outlier detection performance for principal components analysis method and its...

Documents