gene family classification using a semi-supervised learning method
DESCRIPTION
Nan Song Advisors: John Lafferty, Dannie Durand. Gene family classification using a semi-supervised learning method. Outline. Introduction A motivating application: genome annotation A graphical model of sequence relatedness Gene classification using machine learning - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/1.jpg)
1
Gene family classification using a semi-supervised learning method
Nan SongAdvisors: John Lafferty, Dannie Durand
![Page 2: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/2.jpg)
2
Outline
• Introduction – A motivating application: genome annotation
• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion
![Page 3: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/3.jpg)
The complete genetic material
of an organism or species
The Genome
![Page 4: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/4.jpg)
Key genomic component: genes
ACCCTTAGCTAGACCTTTAGGAGG...
A gene is a DNA subsequence
![Page 5: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/5.jpg)
Key genomic component: genes
Genes encode proteins, the building blocks of the cell
ACCCTTAGCTAGACCTTTAGGAGG...
A gene is a DNA subsequence
A protein is an amino acid sequence
V H L T P E...
Genes encode proteins, the building blocks of the cell
ACCCTTAGCTAGACCTTTAGGAGG...
A gene is a DNA subsequence
A protein is an amino acid sequence
V H L T P E...
![Page 6: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/6.jpg)
6
413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria
In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes
www.genomesonline.org
Whole Genome Sequencing
![Page 7: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/7.jpg)
• atgcaccttg
![Page 8: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/8.jpg)
8
Gene prediction and annotationInternational Human Genome Consortium, Nature 2001
Predicted genes16,896
Total31,778
Known genes14,882
![Page 9: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/9.jpg)
Gene annotation• We are given a new genome sequence with
predicted genes.
• A few genes are well studied.
• Identify other genes in the same family to predict function.
• Verify predictions experimentally
Two contexts: – Individual scientist
– High throughput
![Page 10: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/10.jpg)
10
Outline
• Introduction – Molecular biology– A motivating application: genome
annotation• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion
![Page 11: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/11.jpg)
11
Evolutionarily related genes have related functions
Ancestral geneatgccaggactcccagtga…
atgcgccgtctggcatgt…
β-globin
atgcaaggagtcccagagc…γ-globin
atgcgaggtctcccatgt…
ε-globin
Adult Fetal Embryonic
Duplication
Duplication
![Page 12: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/12.jpg)
Evolutionarily related genes have related functions
Gene family classification is a powerful source of information for inferring evolutionary,
functional and structural properties of genes
atgcgaggtctcccatgt…
Ancestral geneatgccaggactcccagtga…
Duplication
Duplication
atgcgccgtctggcatgt… atgcaaggagtcccagagc…
β-globin γ-globin ε-globin
![Page 13: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/13.jpg)
13
Outline
• Introduction • A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion
![Page 14: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/14.jpg)
14
…atgcaaggagtcccagagcc……atgcgaggtctcccagtgtc…xi
xj
A graphical model of sequence relatedness
E: weight of the edge is proportional to the similarity between sequences.
G = (V,E)
V: represent sequences
![Page 15: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/15.jpg)
15
xi
xj
A graphical model of sequence relatedness
E: weight of the edge is proportional to the similarity between sequences.
G = (V,E)
V: represent sequences
![Page 16: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/16.jpg)
16
xi
xj
Gene family classification
Goal: Given known genes, identify genes in the same family.
Biological scenario: • small number of known genes• large number of unknown genes
![Page 17: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/17.jpg)
17
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation
• Conclusion
![Page 18: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/18.jpg)
18
Framework: binary classification
Determine which unlabeled genes belong to the family.
Machine learning scenario: • small number of labeled data
• genes known to be in family• genes clearly not in family
• large number of unlabeled data
![Page 19: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/19.jpg)
19
Several challenging problems of gene family classification
Traditionally, similarity is represented by sequence comparison
atgcgccgtctggcatgt…atgcaaggagtcccagagc…
atgcgaggtctcccatgt…
Ancestral gene
Duplication
Duplication
Mutations
atgcgccccccggcatgt… DNA shuffling
atgcgccgtctggcatgt…ggctcgta
![Page 20: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/20.jpg)
20
Several challenging problems of gene family classification
Traditionally, similarity is represented by sequence comparison
atgcgccgtctggcatgt…atgcaaggagtcccagagc…
atgcgaggtctcccatgt…
Ancestral gene
Duplication
Duplication
Mutations
atgcgccccccggcatgt… DNA shuffling
atgcgccgtctggcatgt…ggctcgta
![Page 21: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/21.jpg)
21
Several challenging problems of gene family classification
Families
– do not form a clique– do not form a connected
component– have edges to sequences outside
the family.
![Page 22: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/22.jpg)
22
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning– Semi-supervised learning algorithm– Supervised learning algorithm
• Empirical evaluation
• Conclusion
![Page 23: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/23.jpg)
23
Gene family classification
Goal: Binary classification
Machine learning scenario: • large number of unlabeled data• small number of labeled data
Semi supervised learning:• Exploit information from both labeled and unlabeled data
• Performed well in many applications
![Page 24: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/24.jpg)
24
Graphical semi-supervised learning (Binary classification)
Notation:
• V: The whole data set
• L: Labeled data set
• U: unlabeled data set
• Each vertex: (xi,yi) or (xk, f(k))
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)
(xi,yi = 1)
(xj,yj = 0)(xk,f(k))
![Page 25: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/25.jpg)
25
Graphical semi-supervised learning (Binary classification)
(xi,yi = 1)
(xj,yj = 0)(xk,f(k))
• Output: – Assign a real value to every
vertex in the graph– Find a cutoff to separate the
two classes
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)
• Input: – family members (xi, yi = 1) – nonfamily members: (xj, yj = 0)
![Page 26: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/26.jpg)
26
Graphical semi-supervised learning (Binary classification)
(xi,yi = 0)
G = (V,E)L: Labeled data setU: unlabeled data set
(xn,yp = 1)(xk,f(k))
Assign real values to all vertices in the graph, to minimize E(f):
)( exp where
)( ..
)))()((()( 2
ij
ij
ii
ijVjVi
SW
Lxyifts
jfifWfE
Sij
![Page 27: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/27.jpg)
27
Graph-based semi-supervised learning
f(xk)
http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html
Works well
![Page 28: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/28.jpg)
28
Graph-based semi-supervised learning
f(xk)
http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html
Works well Works well ?
![Page 29: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/29.jpg)
29
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning– Semi-supervised learning– Supervised learning
• Empirical evaluation
• Conclusion
![Page 30: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/30.jpg)
Semi-supervised vs kernel-based supervised learning
• Semi-supervised learning:
• Supervised learning:
Lxyifts
jfifWfE
ii
ijVjVi
)( ..
)))()((()( 2
)))()((()( 2jfifWfE ijUjLi
where L is the labeled data set and U is the unlabeled data set
![Page 31: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/31.jpg)
31
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation– Methodology– Results
• Conclusion
![Page 32: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/32.jpg)
32
Graph construction
G = (V,E)
V: All mouse sequences from SwissProt (n = 7439)
E: based on newly designed sequence similarity measurement.
0 < S(i, j) < 1
![Page 33: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/33.jpg)
33
Methodology• Graph construction
• Test set construction
• Experiments performed
• Basis for evaluation
![Page 34: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/34.jpg)
Test set construction
18 well studied protein families– Receptors, enzymes, transcription factors,
motor proteins, structural proteins, and extracellular matrix proteins.
ACSL FOX Laminin
PDE TRAF
ADAM GATA
SEMA
T-box
DVL Kinase
Myosin
USP
FGF Kinesin
Notch TNFR
WNT
![Page 35: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/35.jpg)
35
Test set construction
• Retrieved all complete mouse sequences from SwissProt database (7,439)
• Identified sequences for each test family based on
– Nomenclature committee reports
– Structural properties
– Literature surveys
Family size ACSL 5 ADAM 26 DVL 3 FGF 20 FOX 30 GATA 6 Kinase 293 Kinesin 18 Laminin 11 Myosin 12 Notch 4 PDE 15 SEMA 16 TNFR 24 TRAF 9 Tbox 6 USP 18 WNT 19
![Page 36: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/36.jpg)
36
Methodology• Graph construction
• Test set construction
• Experiments performed
• Basis for evaluation
![Page 37: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/37.jpg)
Experiments performed
• Compare semi-supervised with supervised learning algorithm
• Tested parameters:– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
![Page 38: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/38.jpg)
Tested parameters
number of Labeled Family members
number of Non-labeled Family members
σ
For each set of parameters, 20 tests were performed
![Page 39: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/39.jpg)
Tested parameters (1)
Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100
S
W
0.02
σ=100
σ=10
σ=1
σ=0.5
σ=0.2σ=0.1
0.080.05
10.80.60.40.20
1
0.8
0.6
0.4
0.2
0
)(exp where),))()((()( 2
ij
ijijVjVi
SWjfifWfE
![Page 40: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/40.jpg)
Tested parameters (2)
• Labeled Family members (LF):
10-70% of family size • Labeled Nonfamily members (LN) :
100, 500, 1000
about 1 - 10% of nonfamily size
Family size
LF
ACSL 5 1, 3 ADAM 26 3, 5, 7, 9,15 DVL 3 1 FGF 20 3, 5, 7, 11, 15 FOX 30 3, 9, 15 GATA 6 1,3 Kinase 293 3, 7, 11, 15,
20, 50, 150 Kinesin 18 2, 6, 9 Laminin 11 1, 3, 5, 7 Myosin 12 2, 4, 6, 9 Notch 4 1, 2 PDE 15 2, 5, 7, 10 SEMA 16 2, 5, 8 TNFR 24 2, 4, 8, 12, 18 TRAF 9 1, 3 Tbox 6 2, 5 USP 18 2, 4, 6, 9,13 WNT 19 2, 9
Database size: 7439
![Page 41: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/41.jpg)
41
Methodology• Graph construction
• Test set construction
• Experiments performed
• Basis for evaluation
![Page 42: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/42.jpg)
42
Semi-supervised learning
Goal:
f(i) > f(j) when xi is a family member and xj is not.
Evaluation criteria:
• Visualization
• AUC score
• False negatives
![Page 43: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/43.jpg)
VisualizationSort all unlabeled data by f(x)
f(x)
Rank
Family members
Nonfamily members
![Page 44: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/44.jpg)
1 - specificity
sens
itivi
ty
11 1
ba
n
i
n
jba
nnAUC
a b
ji
f(x)
Rank
Family members
Nonfamily members
AUC (Area Under ROC Curve)
Rank plot
![Page 45: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/45.jpg)
Advantages of rank plot
AUC = 0.9382
![Page 46: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/46.jpg)
AUC scores do not reflect all information we need
• False negatives after the first false positive
• The number of missed data after the first false positive
![Page 47: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/47.jpg)
47
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation– Methodology– Results
• Conclusion
![Page 48: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/48.jpg)
48
Several challenging problems of gene family classification
Families
– do not form a clique– do not form a connected
component– have edges to sequences outside
the family.
Edges to sequences outside the family are mainly a problem if they have strong edge weights
![Page 49: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/49.jpg)
49
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
![Page 50: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/50.jpg)
Results
• Compare semi-supervised with supervised learning algorithm
• Tested parameters:– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
![Page 51: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/51.jpg)
Tested parameters
number of Labeled Family members
number of Non-labeled Family members
σ
0.9996
0.9998
1
Notch, Lf = 1, Ln =1000
10.1 0.5 100.2
AU
C (
ave)
![Page 52: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/52.jpg)
The effect of σ
Raw similarity score (s)
W
0.02
σ=100
σ=10
σ=1
σ=0.5
σ=0.2σ=0.1
0.080.05
10.80.60.40.20
1
0.8
0.6
0.4
0.2
0
)(exp where),))()((()( 2
ij
ijijVjVi
SWjfifWfE
![Page 53: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/53.jpg)
53
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
![Page 54: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/54.jpg)
Edges to sequences outside the family are mainly a problem if they have strong edge weights
![Page 55: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/55.jpg)
Edges to sequences outside the family are mainly a problem if they have strong edge weights
FOX Notch
Nu
mb
er o
f ed
ges
Raw edge weight Raw edge weight
![Page 56: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/56.jpg)
Case study: Rank plots for semi-supervised learning in FOX
σ = 0.1σ =1
σ = 10σ=100
LF = 3, LN = 100, family size: 30
![Page 57: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/57.jpg)
Case study: rank plots for semi-supervised learning in Notch
labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435)
σ = 0.1
σ = 1
σ = 10
σ= 10
![Page 58: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/58.jpg)
0.9996
0.9998
1
Notch, Lf = 1, Ln =1000
10.1 0.5 100.2
AU
C (
ave)
0.9996
0.9998
1
10.1 0.5 100.2
AU
C (
ave)
FOX, Lf = 3, Ln =1000
σ
![Page 59: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/59.jpg)
Summary on σ
• For most families, the performance is not very sensitive to σ
• For almost all families that form a clique, there is at least one value of sigma (usually many)– such that both semi-supervised and supervised
learning algorithms have perfect classfication performance
![Page 60: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/60.jpg)
Results
• Compare semi-supervised with supervised learning algorithm
• Tested parameters:– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
![Page 61: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/61.jpg)
61
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
![Page 62: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/62.jpg)
The connection among sequences in ADAM family
0
2
4
6
8
10
12
14
16
# of connected ADAM sequences269 24 25
![Page 63: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/63.jpg)
The connection among sequences in ADAM family
![Page 64: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/64.jpg)
Tested parameters
number of Labeled Family members
number of Non-labeled Family members
σ
By taking the maximum
number of Labeled Family members
number of Non-labeled Family members
achieve the best average AUC score
![Page 65: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/65.jpg)
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
Supervised, LN =100
# labeled family seqs, LF
ADAM
![Page 66: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/66.jpg)
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
Supervised, LN =100
# labeled family seqs, LF
Semi-supervised, LN = 100
ADAM
Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters
![Page 67: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/67.jpg)
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
Supervised, ln =100
# labeled family seqs
Supervised, ln =1000
Semi-supervised, ln = 100
ADAM
![Page 68: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/68.jpg)
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
# labeled family seqs
Semi-supervised, ln = 1000
Supervised, ln =100
Supervised, ln =1000
Semi-supervised, ln = 100
ADAM
![Page 69: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/69.jpg)
Graph structure of ADAM
• Troublemaker: ADAMTS10 matches with only 8
out of 26 sequences in ADAM family.
• ADAMTS10 is often misclassified
• ADAMTS10 is implicated in a genetic disease
that causes impaired vision and heat defects.
![Page 70: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/70.jpg)
70
Semi-supervised method
Supervised method
![Page 71: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/71.jpg)
71
Several challenging problems of gene family classification
• Sequences in the same family – do not form a clique– do not exist in the same connected component
• Sequences in different families – have significant matches
![Page 72: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/72.jpg)
72
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
![Page 73: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/73.jpg)
The connection among sequences in TNFR family
![Page 74: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/74.jpg)
The connection among sequences in TNFR family
10 11 12 13 14 15 16 17 18 19 20
6
4
2
# of connected TNFR sequences
20 TNFR in this connected component
![Page 75: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/75.jpg)
0.92
0.96
1
TNFR (family size 24)
82 4 1812
Semi, ln = 1000
Supervised, ln =100
Supervised, ln=1000Semi, , ln = 100
AU
CThe impact of number of labeled family and
nonfamily members on the performance
![Page 76: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/76.jpg)
Summary for Number of labeled family members
• The performance of both semi-supervised
and supervised learning improves as LF
increases for all families.
• In non-clique families, semi-supervised
learning performs better than supervised
when LF is small.
![Page 77: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/77.jpg)
Rank plots for semi-supervised learning in TNFR
σ= 0.1 Lf = 2, ln = 100
AUC values do not reflect all information that we need
![Page 78: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/78.jpg)
TNFR (family size 24)
82 4 1812
Semi, ln = 1000
Supervised, ln =100
Supervised, ln=1000
Semi, , ln = 100
Nu
mb
er o
f m
isse
d T
NF
RThe impact of number of labeled family and
nonfamily members on the performance
![Page 79: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/79.jpg)
Summary for Number of labeled family members
• The performance of both semi-supervised
and supervised learning improves as LF
increases for all families.
• In non-clique families, semi-supervised
learning performs better than supervised
when LF is small.
![Page 80: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/80.jpg)
Summary for Number of labeled non-family members (LN)
• The performance supervised learning
improves as LN increases for all families.
• For semi-supervised learning, sometimes
LN is sometimes helpful and sometimes
not.
![Page 81: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/81.jpg)
81
Summary of results
Clique
Connected
Small LF Large LF
100 1000 100 1000 100 1000 100 1000
Super Semi Super semi Super Semi Super semi ACSL DVL FOX GATA Kinesin Myosin Laminin 0.9999 0.9999 0.9999 0.9999 1 1 1 1 Notch SEMA TRAF Tbox WNT ADAM 0.9951 1 0.9989 1 1 1 1 1 FGF Kinase 0.9549 0.9644 0.9745 0.9738 0.9656 0.9666 0.9804 0.9771 PDE 0.9181 0.9364 0.9644 0.9589 0.9612 0.9603 0.9850 0.9769 TNFR 0.9297 0.9420 0.9537 0.9526 0.9628 0.9671 0.9845 0.9866 USP 0.9792 0.9798 0.9907 0.9895 0.9827 0.9875 0.9900 0.9895
![Page 82: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/82.jpg)
Insights - 1
• SSL is most effective for families that are not cliques but are connected.
• In test set, 12/18 cliques, 3/18 not connected.
• What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?
![Page 83: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/83.jpg)
Insights - 2
• Performance evaluation measures should match the needs of the user.
– AUC scores penalize all FNs and FPs.
– For experimental biologists, top ranked predictions are of interest
– The number of FNs after the first false positive can reveal some information
![Page 84: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/84.jpg)
Insights - 3
• Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small
![Page 85: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/85.jpg)
Acknowledgements
• John Lafferty
• Dannie Durand
• Jerry Zhu
Durand Lab• Robbie Sedgewick
• Rose Hoberman
• Ben Vernot
• Narayanan Raghupathy
• Aiton Goldman
• Jacob Joseph
• Annette McLeod
• Maureen Stolzer
![Page 86: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/86.jpg)
Thank You !
![Page 87: Gene family classification using a semi-supervised learning method](https://reader035.vdocuments.us/reader035/viewer/2022070416/56815015550346895dbdfb78/html5/thumbnails/87.jpg)