Universty of São Paulo
Institute of Mathematics and Statistics
Computer Science Department
Introduction to Pattern Recognition
A Bioinformatics Viewpoint
Roberto Marcondes Cesar Junior (IME-USP)
http://www.ime.usp.br/~cesar/[email protected]
OrganizationOrganization
Introduction
Case Study
Generalizing the ConceptsConcluding Remarks
IntroductionIntroduction
Pattern Recogntion To recognize is to classify. To classify an object is to label the object. An object is anything we want to recognize.
Applications Computer Vision Speech recognition Bioinformatics ...
Case StudyCase Study
We are interested in studying some disease, which we will call disease X.
Hypothesis:There are some different types of disease X,
which will be called A, B, C...
Question:What is the expression behaviour of a given set of
genes g1, g2, ...gn with respect to A, B, C...?
Case StudyCase Study
First step: gathering some sick people
C1 C5 C6C2 C3 C4
Case StudyCase Study
Second step:Each case will be analyzed based on the gene
expression with respect to g1, g2, ...gn
Therefore, we have to measure gene expression of the genes of interest for each case C1, C2, ..., C6
Ex: Microarrays
Case StudyCase Study
1 5 6 2 3 4
Case StudyCase Study
............
...3.0710
...17.02
...0920
Case StudyCase Study
1 5 6 2 3 4
............
...3.0710
...17.02
...0920
............
...3121
...4.0120
...1530
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
............
M1 M 2 M 3 M 4 M 5 M 6
Case StudyCase Study
............
...3.0710
...17.02
...0920
...
1
7.0
2
...
0
9
20Expression vector: stacking the array lines
Case StudyCase Study
...
1
7.0
2
...
0
9
20
...
4.0
1
20
...
1
5
30
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Case StudyCase Study
Brief: Each case C1, ..., C6 is represented by a
vector v1, v2, ..., v6
Each coordinate in the expression vectors corresponds to the expression of a given gene gi
Case StudyCase Study
Some PR terminology:
...
1
7.0
2
...
0
9
20
Feature
Feature Vector
Case StudyCase Study
Trainning Set
Sample
Case StudyCase Study
Let’s simplify things: We’re only interested in two genes g1
and g2.
15
13 ,
12
16 ,
17
15 ,
1
4 ,
5
2 ,
4
1
v1 v2 v3 v4 v5 v6
2
1 : vectorsFeatureg
g
0 2 4 6 8 10 12 14 16 18 200
2
4
6
8
10
12
14
16
18
20
Case StudyCase Study
g2
g1
v1
v2
v3
v4
v5
v6
Type A
Type B FeatureSpace
Classes
Case Study: the classifierCase Study: the classifier
0 2 4 6 8 10 12 14 16 18 200
2
4
6
8
10
12
14
16
18
20
Input
Trainning set with unlabelled samples
0 2 4 6 8 10 12 14 16 18 200
2
4
6
8
10
12
14
16
18
20
Output
Classes of thefeature space
Unsupervised classifier:Clustering algorithm
Case Study: Linkage AlgorithmCase Study: Linkage Algorithm
0 2 4 6 8 10 12 14 16 18 200
2
4
6
8
10
12
14
16
18
20
Case Study: Linkage AlgorithmCase Study: Linkage Algorithm
0 2 4 6 8 10 12 14 16 18 200
2
4
6
8
10
12
14
16
18
20 v2
v1
v3
v4
v6
v5
Dendrogram
Case Study: Visualization Case Study: Visualization
Intermezzo: vectors as signals
...
...
...
...
...
0
9
20
0 2 4 6 8 10 12 14 16 180
5
10
15
20
Case Study: Visualization Case Study: Visualization
Intermezzo: signals as images
0 2 4 6 8 10 12 14 16 180
5
10
15
20
Generalizing the conceptsGeneralizing the concepts
Putting all together: datamining
Concluding remarksConcluding remarks
Supervised classification
Which classifier should be used?
Be careful: clustering algorithms always find clusters!
Normalization issues
Concluding remarksConcluding remarks
A key problem: which genes should be used?
Or: which features should be selected?
Well-known problem in PR: Dimensionality Reduction
Concluding remarksConcluding remarks
Y1
Y2
Concluding remarksConcluding remarks
Feature space 1
Concluding remarksConcluding remarks
Feature space 2
Concluding remarksConcluding remarks
Feature space 3