examining outlier detection performance for principal components analysis method and its...

Upload: ijaet-journal

Post on 14-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    1/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    573 Vol. 6, Issue 2, pp. 573-582

    EXAMINING OUTLIERDETECTION PERFORMANCE FOR

    PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS

    ROBUSTIFICATION METHODS

    Nada Badr, Noureldien A. Noureldien

    Department of Computer Science

    University of Science and Technology, Omdurman, Sudan

    ABSTRACT

    Intrusion detection has gasped the attention of both commercial institutions and academic research area. In thispaper PCA (Principal Components Analysis) was utilized as unsupervised technique to detect multivariate

    outliers on the dataset of an hour duration of time. PCA is sensitive to outliers since it depend on non-robustestimators. This lead us using MCD (Minimum Covariance Determinant) and PP (Projection Pursuit) as two

    different robustification techniques for the PCA. The results obtained from experiments show that PCA

    generates a high false alarms due to masking and swamping effects, while MCD and PP detection rate is much

    accurate and both reveals the effects of masking and swamping undergo the PCA method.

    KEYWORDS:Multivariate Techniques, Robust Estimators, Principal Components, Minimum Covariance

    Determinant, Projection Pursuit.

    I. INTRODUCTIONPrincipal Components Analysis (PCA) is a multivariate statistical method that concerned withanalyzing and understanding data in high dimensions, that is to say, PCA method analyzes data setsthat represent observations which are described by several dependent variables that are intercorrelated. PCA is one of the best known and most used multivariate exploratory analysis technique[5].Several robust competitors to classical PCA estimators have been proposed in the literature. A naturalway to robustify PCA is to use robust location and scatter estimators instead of the PCA's samplemean and sample covariance matrix when estimating the eigenvalues and eigenvectors of thepopulation covariance matrix. The minimum covariance determinant (MCD) method is a highlyrobust estimator of multivariate location and scatter. Its objective is to find h observations out ofnwhose covariance matrix has the lowest determinant. The MCD location estimate then is the mean ofthese h points, and the estimate of scatter is their covariance matrix. Another robust method for

    principal component analysis uses the Projection-Pursuit (PP) principle. Here, one projects the data ona lower-dimensional space such that a robust measure of variance of the projected data will bemaximized.In this paper we investigate the effectiveness of the robust estimators provided by MCD and PP, byapplying PCA on Abilene dataset and compare its detection performance of dataset outliers to MCD

    and PP.The rest of this paper is organized as follows. Section 2 is an overview to related work. Section 3 wasdedicated for classical PCA. PCA robustification methods, MCD and PP are discussed in section 4.In section 5 the experiment results are shown, conclusions and future work are drawn in section 6.

    II. RELATED WORKA number of researches have utilized principal components analysis to reduce the dimensionality andto detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    2/10

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    3/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    575 Vol. 6, Issue 2, pp. 573-582

    3.1 PCA Advantages

    PCA common advantages are:

    3.1.1 Exploratory Data Analysis

    PCA is mostly used for making 2-dimensional plots of the data for visual examination and

    interpretation. For this purpose, data is projected on factorial planes that are spanned by pairs ofPrincipal Components chosen among the first ones (that is, the most significant ones). From theseplots, one will try to extract information about the data structure, such as the detection of outliers

    (observations that are very different from the bulk of the data).Due to most researches [8][11], PCA detect two types of outliers, type(1): the outlier that inflate

    variance and this is detected by the major PCs and type (2): outlier that violate structure, which aredetected by minor PCs.

    3.1.2 Data Reduction Technique

    All multivariate techniques are prone to the bias variance tradeoff, which states that thenumber of variables entering a model should be severely restricted. Data is often describedby many more variables than necessary for building the best model. PCA is better than

    other statistical reduction techniques in that, it select and feed the model with reducednumber of variables.

    3.1.3 Low Computational Requirement

    PCA needs low computational efforts since its algorithm constitutes simple calculations.

    3.2 PCA Disadvantages

    It may be noted that the PCA is based on the assumptions that, the dimensionality of data can beefficiently reduced by linear transformation and most information is contained in those directionswhere input data variance is maximum.As it is evident, these conditions are by no means always met. For example, if points of an input set

    are positioned on the surface of a hyper sphere, no linear transformation can reduce dimension(nonlinear transformation, however, can easily cope with this task). From the above the followingdisadvantage of PCA are concluded.

    3.2.1 Depending On Linear Algebra

    It relies on simple linear algebra as its main mathematical engine, and is quite easy to interpret

    geometrically. But this strength is also a weakness, for it might very well be that other syntheticvariables, more complex than just linear combinations of the original variables, would lead to a morecomplex data description.

    3.2.2 Smallest Principal Components Have No Attention in Statistical Techniques

    The lack of interest is due to the fact that, compared with the largest principal components that

    contain most of the total variance in the data, the smallest principal components only contain thenoise of the data and, therefore, appear to contribute minimal information. However, because outliers

    are a common source of noise, the smallest principal components should be useful for outlierdetection.

    3.2.3 High False Alarms

    Principal components are sensitive to outliers, since the principal components are determined bytheir directions and calculated from classical estimator such classical mean and classical covariance

    or correlation matrices.

    IV. PCAROBUSTIFICATIONIn real datasets, it often happens that some observation are different from the majority, suchobservation are called outliers, intrusion, discordant, etc. However classical PCA method can be

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    4/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    576 Vol. 6, Issue 2, pp. 573-582

    affected by outliers so that PCA model cannot detect all the actual real deviating observation, this isknown as masking effect. In addition some good data points might even appear to be outliers whichare known as swamping effect .

    Masking and swamping cause PCA to generate a high false alarm. To reduce this high false alarms

    using robust estimators was proposed, since outlying points are less likely to enter into the

    calculation of the robust estimators.

    The well-known PCA Robustification methods are the minimum covariance determinant (MCD) andProjection-Pursuit (PP) principle. The objective of the raw MCD is to find h > n/2 observations outofn whosecovariance matrix has the smallest determinant. Its breakdown value is (bn= [n- h+1]/n),hence the number h determines the robustness of the estimator. InProjection-Pursuit principle [3],one projects the data on a lower-dimensional space such that a robust measure of variance of theprojected data will be maximized. PP is applied where the number of variables or dimensions is very

    large, so PP has an advantage over MCD, since the MCD proposes the dimensions of the dataset notto exceed 50 dimensions.Principal Component Analysis (PCA) is an example of the PP approach, because they both search fordirections with maximal dispersion of the data projected on it, but PP instead of using variance as

    measure of dispersion, they use robust scale estimator [4].

    V. EXPERIMENTS AND RESULTSIn this section we show how we test PCA and its robustification methods MCD and PP on a dataset.The data that was used consist of OD (Origin-Destination) flows which, are collected and made

    available by Zhang [1]. The dataset is an extraction of sixty minutes traffic flows from first week ofthe traffic matrix on 2004-03-01, which is the traffic matrix Yin Zhang was built from Abilenenetwork. Availability of the dataset is on offline mode, where it is extracted from offline trafficmatrix.

    5.1 PCA on Dataset

    At first, the dataset or the traffic matrix is arranged into the data matrix X, where rows representobservations and columns represent variables or dimensions.

    X (14412) =[, , , ,],

    The following steps are considered in apply PCA method on the dataset.

    Centering the dataset to have zero mean, so the mean vector is calculated from the followingequation: = (1)

    and subtracted off the mean for each dimension.The product of this step is another centered data matrix Y, which has the same size as original dataset

    , , (2) Covariance matrix is calculated from the following equation: . (3) Finding eigenvectors and eigenvalues from the covariance matrix where eigenvalues are diagonal

    elements of the matrix by using eigen-decomposition technique in equation (4). YE = (4)Where E is the eigenvectors, is the eigenvalues .

    Ordering eigenvalues in decreasing order and sorting eigenvectors according to the orderedeigenvalues in loadings matrix. The Eigenvectors matrix is then sorted to be loading matrix.

    Calculating scores matrix (dataset projected on principal components), which declares therelations between principal components and observations. The scores matrix is calculated from

    the following equations:, , , (5)

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    5/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    577 Vol. 6, Issue 2, pp. 573-582

    Applying the 97.5 tolerance ellipse of the bivariate dataset (data projected on first PCS, dataprojected on minor PCS) to reveal outliers automatically. The ellipse is defined by these datapoints whose distance is equal to the chisquare root of the 97.5 quantile with 2 degrees of

    freedom. The form of the distance is ,.75 (6)The screeplot is used and studied and the first and the second principal components accounted for98% of total variance of the dataset, so retaining the first two principal components to represent thedataset as whole, figure (1) shows the screeplot, the plotting of the data projected onto the first twoprincipal components in order to reveal the outliers on the dataset visually is shown in figure (2).

    Figure 1: PCA Screeplot Figure 2: PCA Visual outliers

    Figure (3) shows tolerance ellipse on major PCS, and figures (4) and (5) shows the visual recording ofoutliers from scatter plots of data projected on robust minor principal components and the outliersdetected by robust minor principal components tuned by tolerance ellipse respectively.

    Figure 3: PCA Tolerance Ellipse Figure 4: PCA type2 Outliers

    .

    0 2 4 6 8 10 120

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    principal components

    totalvariance

    variances

    -2 -1 0 1 2 3 4 5 6 7

    x 107

    -1

    -0.5

    0

    0.5

    1

    1.5

    2x 10

    7 data projected on major pcs

    PC1

    PC2

    66

    120

    119

    135

    676871 7577788283

    86

    878889 90

    9698

    101103105

    111112113115

    126

    127128

    132

    134136139141

    125

    129

    130 131144

    124

    116

    117 118

    58606465 76798081 82

    84

    859192939495107108109110 114115121

    137138140

    142143

    123456789101112131415161718192021222324252627282930313233343536373839404243444546474849505152535455565758596061626365106122123

    133

    6972102

    73747370

    104

    -4 -2 0 2 4 6

    x 107

    -5

    0

    5

    10

    15

    x 106

    PC1

    PC2

    123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960

    616263646576 777879808182

    84

    859192939495106107108109110 111112113114115

    121122123

    124

    125 130

    137138

    139140141

    142143

    66

    6768697071727374

    7583

    86

    87888990

    969798

    99100101

    102103

    104105

    116

    117 118

    119

    120

    126

    127128

    129

    131

    132

    133

    134

    135

    136

    144

    Tolerance ellipse (97.5%)

    -8 -6 -4 -2 0 2 4 6

    x 105

    -6

    -4

    -2

    0

    2

    4

    6

    8x 10

    5 data projected on minor pcs

    last PC-1

    lastPC

    12345 678910

    11

    12131415

    16

    171819202122232425

    26

    2728293031 3233343536

    37383940

    41

    4243444546474849505152535455

    5657585960

    6162636465

    66

    6768

    70

    71

    72

    73

    74

    75

    7778 798081

    8283

    8485

    8788

    8990

    91

    92939495

    99100

    101

    102103

    104105

    106

    107108

    109110

    111

    112113114115

    116

    117118

    119

    120

    121

    122123124

    125

    126

    127128

    129130

    131

    132133134135

    136

    137138

    139140

    141

    142143

    144

    86

    76

    98

    96

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    6/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    578 Vol. 6, Issue 2, pp. 573-582

    Figure 5: Tuned Minor PCS

    5.2 MCD on Dataset

    Testing robust statistics MCD (Minimum Covariance Determinant) estimator yields robust locationmeasure Tmcdand robust dispersion mcd.The following steps are applied to test MCD on the dataset in order to reach the robust principalcomponents.

    MCD measure is calculated from the formula:R=(xi-Tmcd(X))T.inv(mcd(X)).(xi-Tmcd(X) ) for i=1 to n (7)Tmcd or mcd =1.0e+006 *

    From robust covariance matrix mcd calculating the followings:C(X)mcd or (x)mcd = 1.0e+012 *

    * find robust eigenvalues as diagonal matrix as in equation (4) by replacing n with h* find robust eigenvectors as loading matrix as in equation (5).

    Calculating robust scores matrix as in the following form, , , (8)The robust screeplot retaining the first two robust principal components which accounted above of98% of total variance is shown in figure (6). Figures (7) and (8) shows respectively the visual

    recording of outliers from scatter plots of data projected on robust major principal components, andthe outliers detected by robust major principal components tuned by tolerance ellipse, and Figures (9)and (10) shows the visual recording of outliers from scatter plots of data projected on robust minorprincipal components and the outliers detected by robust minor principal components tuned bytolerance ellipse respectively.

    Figure 6: MCD screeplot Figure 7: MCD Visual Outliers

    -6 -4 -2 0 2 4

    x 105

    -4

    -2

    0

    2

    4

    6

    x 105

    PC11

    PC12

    12345 678910

    11

    12131415

    16

    171819202122232425

    26

    2728293031 3233343536

    37383940

    41

    4243444546474849505152535455

    56

    57585960

    6162636465

    76

    7778 798081

    82

    84

    85

    91

    92939495106

    107108

    109110

    111

    112113

    114115

    121

    122123124

    125

    130

    137138

    139140

    141

    14214366

    6768

    6970

    71

    72

    73

    74

    75

    83

    86

    8788

    8990

    96

    9798

    99100

    101

    102103

    104

    105

    116

    117118

    119

    120

    126

    127128

    129

    131

    132133134135

    136

    144

    Tolerance ellipse (97.5%)

    0 2 4 6 8 10 120

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100robust mcd screeplot to retain robust PCS

    principal components

    totalvariance

    -8 -7 -6 -5 -4 -3 -2 -1 0 1

    x 107

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5x 10

    7

    robustmcd PC1

    robustmcdPC2

    major pcs from robust estimator

    135

    119

    120

    66

    116

    118117

    129129130 125

    124

    12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656768

    69 70 7172 7374 75 767778 798081

    8283

    84

    8586

    8788 8990 919293949596

    979899100 101

    102 103104 105 106107108109110111112113114115

    121122123

    127128

    132136

    137138

    139140141

    142143

    134

    133

    131

    104104

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    7/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    579 Vol. 6, Issue 2, pp. 573-582

    Figure 8: MCD Tolerance Ellipse Figure 9: MCD type2 Outliers

    Figure 10: MCD Tuned Minor PCs

    5.3 Projection Pursuit on Dataset

    Testing the projection pursuit method on the dataset is included in the following steps:

    Center the data matrix X(n,p) , around L1-median to reach centralized data matrix Y(n,p) as :, , 1 9Where L1(X) is high robust estimator of multivariate data location with 50% resist of outliers [11].

    Construct the directions pi as normalized rows of matrix , `this process include the following: , : ,1: 10 max() 11Where SVD stand for singular value decomposition.

    12

    Project all dataset on all possible directions. (13) Calculate robust scale estimator for all the projections and find the directions that maximize qn

    estimator, m a x() 14qn is a scale estimator, essentially it is the first quartile of all pairwise distance between two datapoints [5]. The results of these steps yields the robust eigenvectors (PCs), and the squared ofvalue of the robust scale estimator is the eigenvalues.

    project all data on the selected direction q to obtain robust principal components as in thefollowing: , (15)

    Update data matrix by its orthogonal complement as in the followings: ( ). (16)

    -6 -4 -2 0 2 4

    x 107

    -5

    0

    5

    10

    15

    20

    x 106

    robustmcdPC1

    robus

    tmc

    dPC2

    12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364657981

    84

    859194106107108109110114121

    122123

    124

    125

    66

    6768

    69 707172 7374 75 767778808283

    86

    8788 8990 92939596

    9798

    99100 101102 103

    104105 111112113115

    116

    117118

    119

    120

    126

    127128

    129130

    131

    132133

    134

    135

    136

    137138

    139140141

    142143

    144

    Tolerance ellipse (97.5%)

    -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

    x 106

    -3

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2x 10

    6 data project on robustmcd minor PCS

    robustmcd last-1 pc

    robustmcd

    lastpc

    116

    96131

    717069

    1019798

    99100

    66

    120119

    848576118

    117

    86

    73

    67

    74141

    91

    81126

    136144

    102134

    102104

    136139

    61248026

    444444

    1131128888

    56

    -2.5 -2 -1.5 -1 -0.5 0 0.5 1

    x 106

    -3

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    x 106

    robustmcd pclast-1

    robustmcdpclast

    12345678910111213141516171819202122232425

    2627282930313233343536373839404142434445464748495051525354555657585960 61

    62636465

    79

    81

    848591

    94106107108109110114

    121122123124125

    66

    6768

    697071

    72

    73

    74

    75

    76

    7778 808283

    86

    8788

    8990929395

    96

    9798

    99100

    101102

    103104105111

    112113

    115

    116

    117

    118

    119120

    126

    127128129130

    131

    132133

    134

    135136

    137138

    139140

    141

    142143144

    Tolerance ellipse (97.5%)

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    8/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    580 Vol. 6, Issue 2, pp. 573-582

    Project all data on the orthogonal complement, (17)The Plotting of the data projected on the first two robust principal components to detect outliers

    visually, is shown in figure (11), and the tuning the first two robust principal components bytolerance ellipse is shown in figure (12). Figures (13) and (14) show respectively the plotting of

    the data projected on minor robust principal components to detect outliers visually, and the tuningof the last robust principal components by tolerance ellipse.

    Figure 11: PP Visual Outliers Figure 12: PP Tolerance Ellipse

    Figure 13: MCD type2 Outliers Figure 14: MCD Tuned Minor PCs

    5.4 Results

    Table (1) summarizes the outliers detected by each method. The table shows that PCA suffers from

    both masking and swamping. The MCD and PP methods results reveal the effects of masking and

    swamping of the PCA method. The PP method results are similar to MCD with slight differencesince we use 12 dimensions on the dataset.

    Table 1: Outliers Detection

    PCA Outlier

    detected by major

    and Minor PCS

    MCD Outliers

    detected by major and

    minor PCS

    PP Outliers

    detected by major

    and minor PCS

    False alarms effects

    Masking Swamping

    66 66 66 No No

    99 99 99 No No

    100 100 100 No No

    116 116 116 No No

    117 117 117 No No

    118 118 118 No No

    119 119 119 No No120 120 120 No No

    -1 0 1 2 3 4 5 6 7 8

    x 107

    -4

    -3.5

    -3

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1x 10

    7 data projected on robust major PCS by PP method

    PProbust PC1

    PP

    robustPC2

    66

    6768

    6970

    71

    727374

    75

    767778

    798081

    82838485

    868788

    8990

    9192939495

    96

    979899100

    101

    102

    103104

    105

    107111112113

    114115

    116

    117

    118

    119

    120

    121 126

    127128

    129130

    131132

    133134

    135

    136

    137138139

    140141

    142143

    144

    -4 -2 0 2 4 6

    x 107

    -4

    -3

    -2

    -1

    0

    1

    x 107

    PProbust PC1

    PProbustPC2

    15 1373419363 7980

    13487

    14422622352023144947502948593033321817431092554422455274528110525360644106

    9088

    1425712264131265123465851268407

    89

    39

    78

    383731

    77

    1092

    21541138

    93949

    958382

    96132

    143107

    84

    56108

    128

    11

    73

    13186

    1401213616

    127

    61126

    124

    85

    103

    114139

    72

    81130

    118

    133

    141

    41115

    102

    75

    129

    125

    117

    91

    71

    74

    112113136

    105101

    6768

    111

    104

    76

    135116

    97981009970

    69

    66

    119

    120

    Tolerance ellipse (97.5%)

    -3 -2 -1 0 1 2 3 4

    x 106

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5x 10

    6 data projected on robust minor PCS by PP

    PProbust PC11

    PP

    robustPC12

    99100

    12345678910111213141516171819202122232425262728293031323334353637383940

    41

    42434445464748495051525354555657585960

    6162636465

    6768

    70

    7273

    77787980

    818283

    8485

    868788

    91

    92939495

    102103

    106107108109110

    114121122123

    127128131132

    133

    134137138139140

    141

    14214314412345678910

    111213141516171819202122232425262728293031323334353637383940

    41

    42434445464748495051525354555657585960

    6162636465

    6768

    71

    7273

    74

    75

    76

    77787980

    81828386

    87888990

    91

    9293949596

    101

    102103

    104105

    106107108109110

    111112113

    114115

    117118

    121122123

    124125126 127

    129130

    131132

    133

    134

    136

    137138139140

    142143144

    135

    119

    120

    9797

    116

    -2 -1 0 1 2 3

    x 106

    -1.5

    -1

    -0.5

    0

    0.5

    1

    x 106

    PProbust PC11

    PProbu

    stPC12

    1513734193637980134871442262235202314494750294859303332181743109255442245527452811052536064410690 88142571226413126512346585126840789 39 78383731771092215411389394 995

    838296

    1321431078456

    108

    1281173131

    86140121

    3616 12761126124 85

    103114 139

    72

    81

    130 118

    133

    141

    4111510275

    129 125117

    917174 112113 136105101

    6768

    111104

    76

    135

    116

    979810099 70

    69 66

    119

    120

    Tolerance ellipse (97.5%)

  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    9/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    581 Vol. 6, Issue 2, pp. 573-582

    129 129 129 No No

    131 131 131 No No

    135 135 135 No No

    Normal Normal 69 Yes No

    Normal Normal 70 Yes No

    71 Normal normal No Yes

    76 Normal normal No Yes81 Normal normal No Yes

    101 Normal normal No Yes

    104 Normal normal No Yes

    111 Normal normal No Yes

    144 Normal normal No Yes

    Normal 84 normal Yes No

    Normal 96 normal Yes No

    Normal 97 97 Yes No

    Normal 98 98 Yes No

    VI. CONCLUSION AND FUTURE WORKThe study has examined the PCA and its robustification methods (MCD, PP) performance forintrusion detection by presenting the bi-plots and extracted outlying observation that are verydifferent from the bulk of data. The study showed that tuned results are identical to visualized one.The study returns the PCA false alarms shortness due to masking and swamping effect. The

    comparison proved that PP results are similar to MCD with slight difference in outliers type 2 sinceare considered as source of noise. Our future work will go into applying the hybrid method

    (ROBPCA), which takes PP as reduction technique and MCD as robust measure for furtherperformance, and applying dynamic robust PCA model with regards to online intrusion detection.

    REFERENCES

    [1]. Abilene TMs, collected by Zhang .www.cs.utexas.edu/yzhang/research, visited on 13/07/2012

    [2]. Khalid Labib and V.Rao Vemuri. "An application of principal Components analysis to the detection

    and visualization of computer network ". Annals of telecommunications, pages 218-234, 2005 .

    [3]. C. Croux, A. Ruiz-Gazen, "A fast algorithm for robust principal components based on projection

    pursuit", COMPSTAT: Proceedings in Computational Statistics, Physica-Verlag, Heidelberg,1996, 211

    217.

    [4]. Mei-ling Shyu, Schu-Ching Chen,Kanoksri Sarinnapakorn,and Li Wuchang. "Anovel anomaly detection

    scheme based on principal components classifier". In proceedings of the IEEE foundations and New

    directions of Data Mining workshop, in conjuction with third IEEE international conference on data mining

    (ICOM03) .

    [5]. J.Edward Jackson . "A user guide to principal components". Wiely interscience Ist edition 2003.

    [6]. Anukool Lakhina,. Mark Crovella, and Christoph Diot. "Diagnosing network wide traffic anomalies".Proceedings of the 2004 conference on Applications, technologies, architectures, protocols for computercommunication. ACM 2004.

    [7]. Yacine Bouzida, Frederic Cuppens, NoraCuppens-Boulahio, and Sylvain Gombaul. "Efficient Intrusion

    Detection Using Principal Component Analysis ". La londe, France, June 2004.

    [8]. R.Gnandesikan, "Methods for statistical data analysis of multivariate observations". Wiely-interscience

    publication New York, 2nd edition 1997.

    [9]. J.Terrel, K.Jeffay L.Zhang, H.Shen, Zhu, and A.Nobel, "Multivariate SVD analysis for a network

    anomaly detection ". In proceedings of the ACM SIGOMM Conference 2005.

    [10]. Challa S.Sastry, Sanjay Rawat, Aurn K.Pujari and V.P Gulati, "Netwok traffic analysis using singular

    value decomposition and multiscale transforms ". information sciences : an international journal 2007.

    http://www.cs.utexas.edu/yzhang/http://www.cs.utexas.edu/yzhang/http://www.cs.utexas.edu/yzhang/http://www.cs.utexas.edu/yzhang/
  • 7/30/2019 EXAMINING OUTLIER DETECTION PERFORMANCE FOR PRINCIPAL COMPONENTS ANALYSIS METHOD AND ITS ROBUS

    10/10

    International Journal of Advances in Engineering & Technology, May 2013.

    IJAET ISSN: 2231-1963

    582 Vol. 6, Issue 2, pp. 573-582

    [11]. I.T.Jollif, "Principal components analysis", springer series in statistics, Springer Network ,2nd edition

    2007.

    [12]. Wei Wong, Xiachong Guan, and Xiangliong Zhang, "Processing of massive audit data streams for realtime anomaly intrusion detection". Computer communications , Elsevier 2008.

    [13]. A Lkhaina, K Papagiannak, M Crovella, C-Diot, E Kolaczy, and N. Taft, "Structural Analysis of

    network traffic flows". In proceedings of SIGMETRICS, New York, NY, USA, 2004.

    AUTHORS BIOGRAPHIES

    Nada Badr earned her BSC in Mathematical and Computer Science at University of

    Gezira, Sudan. She received the MSC in Computer Science at University of Science andTechnology. She is pursuing her PHD in Computer Science at University of Science and

    Technology, Omdurman, Sudan. She currently serving lecturer at the University of

    Science and Technology, Faculty of Computer Science and Information Technology.

    Noureldien A. Noureldien is working as an associate professor in Computer Science,

    department ofComputer Science and Information Technology, University of Science andTechnology, Omdurman, Sudan. He received his B.Sc. and M.Sc. from School of

    Mathematical Sciences, University of Khartoum, and received his PhD in Computer

    Science in 2001 from University of Science and Technology, Khartoum, Sudan. He has

    many papers published in journals of repute. He currently working as the dean of the

    Faculty of Computer Science and Information Technology at the University of Scienceand Technology, Omdurman, Sudan.