qianren (tim) xu
DESCRIPTION
A Significance Test-Based Feature Selection Method for the Detection of Prostate Cancer from Proteomic Patterns. M.A.Sc. Candidate:. Qianren (Tim) Xu. Supervisors:. Dr. M. Kamel Dr. M. M. A. Salama. Neural Networks. STFS. ROC analysis. Highlight. - PowerPoint PPT PresentationTRANSCRIPT
A Significance Test-Based Feature A Significance Test-Based Feature Selection Method for the Detection of Selection Method for the Detection of
Prostate Cancer from Proteomic PatternsProstate Cancer from Proteomic Patterns
Qianren (Tim) XuM.A.Sc. Candidate:
Supervisors: Dr. M. Kamel
Dr. M. M. A. Salama
2
HighlightHighlight
• STFS can be generally used for any problems of supervised pattern recognition
• Very good performances have been obtained on several benchmark datasets, especially with a large number of features
Significance Test-Based Feature Selection (STFS):
Proteomic Pattern Analysis for Prostate Cancer Detection
STFS Neural Networks
ROC analysis
• Sensitivity 97.1%, Specificity 96.8%
• Suggestion of mistaken label by prostatic biopsy
3
Outline of Part IOutline of Part I
Significance Test-Based Feature Selection (STFS) on Supervised Pattern Recognition
• Introduction
• Methodology
• Experiment Results on Benchmark Datasets
• Comparison with MIFS
4
IntroductionIntroduction
Problems on Features
• Large number
• Irrelevant
• Noise
• Correlation
Increasing computational complexity
Reducing recognition rate
5
Mutual Information Feature SelectionMutual Information Feature Selection
• Large number of features and the large number of classes
• Continuous data
dxdyypxp
yxpyxpyxI
x y
)()(
),(log),();(
• But estimation of the mutual information is difficult:
• One of most important heuristic feature selection methods, it can be very useful in any classification systems.
6
Problems on Feature Selection Problems on Feature Selection MethodsMethods
• Computational complexity
• Optimal deficiency
Two key issues:
7
Proposed MethodProposed Method
Significance of feature =
Criterion of Feature Selection
Significant difference IndependenceX
Noncorrelation betweencandidate feature and
already-selected features
Pattern separabilityon individual
candidate features
8
Measurement of Pattern Measurement of Pattern Separability of Individual FeaturesSeparability of Individual Features
Statistical Significant Difference
Continuous data with normal distribution
Continuous data with non-normal distribution
or rank data
Categorical data
Two classes
More than two classes
Two classes
More than two classes
t-test ANOVA Mann-Whitneytest
Chi-squaretest
Kruskal-Wallistest
9
IndependenceIndependence
Independence
Continuous data with normal distribution
Continuous data with non-normal distribution
or rank data
Categorical data
Pearson contingency
coefficientPearson correlation
Spearman rank correlation
21 ncorrelatioceindependen
10
Selecting ProcedureSelecting Procedure
MSDI: Maximum Significant Differenceand Independence Algorithm
MIC: Monotonically Increasing Curve Strategy
11
Maximum Significant Difference and Maximum Significant Difference and Independence (MSDI) AlgorithmIndependence (MSDI) Algorithm
Compute the significance difference (sd) of every initial features
Select the feature with maximum sd as the first feature
Computer the independence level (ind) between every candidate feature and the already-selected feature(s)
Select the feature with maximum feature significance (sf = sd x ind) as the new feature
12
Monotonically Increasing Curve Monotonically Increasing Curve (MIC) Strategy(MIC) Strategy
0 10 20 300.4
0.6
0.8
1
Number of featuresR
ate
of
reco
gn
itio
n
Performance Curve
The feature subset selected by MSDI
Plot performance curve
Delete the features that have “no good” contribution to
the increasing of recognition
Until the curve is monotonically increasing
13
Example I: Handwritten Digit RecognitionExample I: Handwritten Digit Recognition
• Thus 8x8 matrix is generated, that is 64 features
• The pixels in each block is counted
• 32-by-32 bitmaps are divided into 8X8=64 blocks
14
Performance CurvePerformance Curve
0 10 20 30 40 50 600.4
0.5
0.6
0.7
0.8
0.9
1
Number of features
Rat
e o
f re
cog
nit
ion MSDI
MIFS(β=1.0)
MIFS(β=0.8)MIFS(β=0.6)
MIFS(β=0.4)MIFS(β=0.2)
Battiti’s MIFS:
Ss
sfIfCI );();(
Random rankingIt is need to determined β
MSDI: Maximum Significant Difference and IndependenceMIFS: Mutual Information Feature Selector
15
Computational ComplexityComputational Complexity
Selecting 15 features from the 64 original feature set
MSDI: 24 seconds
Battiti’s MIFS: 1110 seconds
(5 vales of β are searched in the range of 0-1)
16
Example II: Handwritten digit recognitionExample II: Handwritten digit recognition
The 649 features that distribute over the following six feature sets:
• 76 Fourier coefficients of the character shapes,
• 216 profile correlations, • 64 Karhunen-Love coefficients, • 240 pixel averages in 2 x 3 windows, • 47 Zernike moments, • 6 morphological features.
17
Performance CurvePerformance Curve
0 10 20 30 40 500.2
0.4
0.6
0.8
1
Number of features
Rat
e o
f re
cog
nit
ion
MSDI + MIC
MSDI
Random ranking
MSDI: Maximum Significant difference and independenceMIC: Monotonically Increasing Curve
18
Comparison with MIFSComparison with MIFS
0 10 20 30 40 500.4
0.5
0.6
0.7
0.8
0.9
1
Number of features
Rat
e o
f re
cog
nit
ion MSDI
MIFS (β=0.2)
MIFS (β=0.5)
MSDI is much better on large number of features
MIFS is better on small number of features
MSDI: Maximum Significant Difference and IndependenceMIFS: Mutual Information Feature Selector
19
Summary on Comparing MSDI Summary on Comparing MSDI with MIFSwith MIFS
• MSDI is much more computational effective• MIFS need to calculate the pdfs• The computational effective criterion (Ba
ttiti’s MIFS) still need to determine β• MSDI only involves the simple statistical
calculation
• MSDI can select more optimal feature subset from a large number of feature, because it is based on relevant statistical models
• MIFS is more suitable on small volume of data and small feature subset
dxdyypxp
yxpyxpyxI
vx y v
vvvv
)()(
),(log),(),(
p
uvuvvvv xxIyxII
1
);();(
20
Outline of Part IIOutline of Part II
Mass Spectrometry-Based Proteomic Pattern Analysis for Detection of Prostate Cancer
• Problem Statement• Methods
• Feature• Classification• optimization
• Results and Discussion
21
Problem StatementProblem Statement
1. Very large number of features
2. Electronic and chemical noise
3. Biological variability of human disease
4. Little knowledge in the proteomic mass spectrum
0 2000 4000 6000 8000 10000 12000 14000 16000 18000-20
020406080
0 2000 4000 6000 8000 10000 12000 14000 16000 18000-20
020406080
0 2000 4000 6000 8000 10000 12000 14000 16000 18000-20
020406080
0 2000 4000 6000 8000 10000 12000 14000 16000 18000-20
020406080
M/Z
Intensity
Mean of non-cancer training sets
Mean of non-cancer training setn=100 most significant features
Mean of cancer training sets
Mean of cancer training setn=100 most significant features
15154 points (features)
22
The system of Proteomic The system of Proteomic Pattern AnalysisPattern Analysis
Training dataset (initial features > 104)
RBFNN / PNN learning
Most significant featuresselected by STFS
Mature classifier
Trained neural classifier
Optimization of the size of featuresubset and the parameters of classifier
by minimizing ROC distance
STFS: Significance Test-Based Feature SelectionPNN: Probabilistic Neural NetworkRBFNN: Radial Basis Function Neural Network
23
Feature Selection: STFSFeature Selection: STFS
MIC
MSDI Significanceof feature
= Significantdifference
Independencex
STFS: Significance Test-Based Feature SelectionMSDI: Maximum Significant Difference and Independence Algorithm MIC: Monotonically Increasing Curve Strategy
StudentTest
Pearsoncorrelation
24
Classification: PNN / RBFNNClassification: PNN / RBFNN
x1
x2
xn
Pool 1
Pool 2
S1
S2
x1
x2
x3
xn
y(1)
y(2)
y yd
PNN is a standard structure with four layers
RBFNN is a modifiedfour-layer structure
PNN: Probabilistic Neural NetworkRBFNN: Radial Basis Function Neural Network
otherwisey
yyyyifyyd
)2(
)2()1()1( ||||
25
Optimization: ROC DistanceOptimization: ROC Distance
Minimizing the ROC distanceto optimize:- Feature subset numbers m - Gaussian spread σ- RBFNN pattern decision weight λ
ROC: Receiver Operating Characteristic
a
b
dROC
False positive rate(1-specificity)
100
1
Tru
e p
osi
tive
rat
e(s
ensi
tivi
ty)
22
22
)1(min
)1()1(min
yspecificitysensitivit
FPTPdROC
26
Results: Results: Sensitivity and SpecificitySensitivity and Specificity
Sensitivity Specificity
Our results 97.1% 96.8%
Petricoin (2002) 94.7% 75.9%
DRE 55-68% 6-33%
PSA 29-80% --
27
Pattern Pattern DistributionDistribution
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.40
10
20
30
40
50
60
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.40
10
20
30
40
50
60
70
False negative 2.9%
False positive 3.2%
True negative 96.8%
True positive 97.1%
Cut-point
Non-Cancer
Cancer
Labelled byBiopsies
Non-Cancer Cancer
Pattern recognizedby RBFNN
28
The possible causes onThe possible causes onthe unrecognizable samplesthe unrecognizable samples
1. The algorithm of the classifier is not able to recognize all the samples
2. The proteomics is not able to provide enough information
3. Prostatic biopsies mistakenly label the cancer
29
Possibility of Mistaken Diagnosis of Possibility of Mistaken Diagnosis of Prostatic BiopsyProstatic Biopsy
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.40
10
20
30
40
50
60
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.40
10
20
30
40
50
60
70
True non-cancer False non-cancer
False cancer True cancer
Cut-point
• Biopsy has limited sensitivity and specificity
• Proteomic classifier has very high sensitivity and specificity correlated with biopsy
• The results of proteomic classifier are not exactly the same as biopsy
• All unrecognizable sample are outliers
31
Summary (1)Summary (1)
Significance Test-Based Feature Selection (STFS):
• STFS selects features by maximum significant difference and independence (MSDI), it aims to determine minimum possible feature subset to achieve maximum recognition rate
• Feature significance (selecting criterion ) is estimated based on the optimal statistical models in accordance with the properties of the data
• Advantages:
• Computationally effective• Optimality
32
Summary (2)Summary (2)
Proteomic Pattern Analysis for Detection of Prostate Cancer
• The system consists of three parts: feature selection by STFS, classification by PNN/RBFNN, optimization and evaluation by minimum ROC distance
• Sensitivity 97.1%, Specificity 96.8%, it would be an asset to early and accurately detect prostate, and to prevent a large number of aging men from undergoing unnecessary prostatic biopsies
• Suggestion of mistaken label by prostatic biopsy through pattern analysis may lead to a novel direction in the diagnostic research of prostate cancer
33
Thanks for your timeThanks for your time
Questions?