hierarchical text categorization and its application to bioinformatics
DESCRIPTION
Hierarchical Text Categorization and its Application to Bioinformatics. Stan Matwin and Svetlana Kiritchenko joint work with Fazel Famili (NRC), and Richard Nock (Université Antilles-Guyane) School of Information Technology and Engineering University of Ottawa. Outline. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/1.jpg)
Hierarchical Text Categorizationand
its Application to Bioinformatics
Stan Matwin and Svetlana Kiritchenko
joint work with Fazel Famili (NRC), and
Richard Nock (Université Antilles-Guyane)
School of Information Technology and Engineering
University of Ottawa
![Page 2: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/2.jpg)
2
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 3: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/3.jpg)
3
Text categorization
• Given: dj D - textual documents
C = {c1, …, c|C|} – predefined categories
• Task: <dj, ci> DC {True, False}
c1
c7
c6
c5c4
c3
c2
TC
![Page 4: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/4.jpg)
4
Hierarchical text categorization
• Hierarchy of categories: ≤ CC - reflexive, anti-symmetric, transitive binary relation on C
c1
c7c6c5c4
c3c2
HTC
![Page 5: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/5.jpg)
5
Advantages of HTC
• Additional, potentially valuable information– Relationships between categories
• Flexibility– High levels: general topics– Low levels: more detail
![Page 6: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/6.jpg)
6
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 7: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/7.jpg)
7
Text classification and bioinformatics
• Clustering and classification of gene expression data– DNA chip time series – performance data
– Gene function, process,… – genetic knowledge - GO
– Literature will connect the two - domain knowledge
• Validation of results from performance data
![Page 8: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/8.jpg)
8
Example: Gene Ontology
![Page 9: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/9.jpg)
9
From data to knowledge via literature
• Functional annotation of genes from biomedical literature
![Page 10: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/10.jpg)
10
Other applications
• Web directories
• Digital libraries
• Patent databases
• Biological ontologies
• Email folders
![Page 11: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/11.jpg)
11
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 12: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/12.jpg)
12
Boosting
• not a learning technique on its own, but a method in which a family of “weakly” learning agents (simple learners) is used for learning
• based on the fact that multiple classifiers that disagree with one another can be together more accurate than its component classifiers
• if there are L classifiers, each with an error rate < 1/2, and the errors are independent, then the prob. that the majority vote is wrong is the area under binomial distribution for more than L/2 hypotheses
![Page 13: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/13.jpg)
13
Why do we have committees (ensembles)?
![Page 14: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/14.jpg)
14
Boosting – the very idea
• Train an ensemble of classifiers, sequentially• Each next classifier focuses more on the
training instances on which the previous one has made a mistake
• The “focusing” is done thru the weighting of the training instances
• To classify a new instance, make the ensemble vote
![Page 15: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/15.jpg)
15
![Page 16: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/16.jpg)
16
Boosting - properties
• If each hl is only better than chances, boosting can attain ANY accuracy!!
• No need for new examples, additional knowledge, etc
• Original AdaBoost is on single-labeled data
![Page 17: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/17.jpg)
17
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 18: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/18.jpg)
18
AdaBoost.MH [Schapire and Singer, 1999]
• (di, Ci) ((di, l), Ci[l]), l C• Initialize distribution P1(i,l) = 1/(mk) .• For t = 1, …, T:
– Train weak learner using distribution Pt .– Get weak hypothesis ht: DC .
• Update:
• The final hypothesis:
![Page 19: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/19.jpg)
19
BoosTexter [Schapire and Singer, 2000]
• “Weak” learner: decision stump
word w
occurs doesn’t occur
1][,:
1][,:1 ),(
),(ln
21
lCdwit
lCdwit
l
ii
ii
liP
liP
q
1][,:
1][,:0 ),(
),(ln
21
lCdwit
lCdwit
l
ii
ii
liP
liP
q
![Page 20: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/20.jpg)
20
Thresholds for AdaBoost
• AdaBoost often underestimates its confidences
• 3 approaches to selecting better thresholds– single threshold for all classes– individual thresholds for each class– separate thresholds for each subtree rooted in the
children of a top node (for tree-hierarchies only)
![Page 21: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/21.jpg)
21
Thresholds for AdaBoost
![Page 22: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/22.jpg)
22
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 23: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/23.jpg)
23
Hierarchical consistency
• if (dj, ci) True,
then (dj, Ancestor(ci)) True
c1
c7c6c5c4
c3c2
c1
c7c6c5c4
c3c2
consistent inconsistent
![Page 24: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/24.jpg)
24
Hierarchical local approach
c1
c7c6c5c4
c3c2
c8 c9
![Page 25: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/25.jpg)
25
Hierarchical local approach
c1
c7c6c5c4
c3c2
c8 c9
![Page 26: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/26.jpg)
26
Hierarchical local approach
c1
c7c6c5c4
c3c2
c8 c9
![Page 27: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/27.jpg)
27
Hierarchical local approach
c1
c7c6c5c4
c3c2
c8 c9
![Page 28: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/28.jpg)
28
Hierarchical local approach
c1
c7c6c5c4
c3c2
c8 c9
consistent classification
![Page 29: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/29.jpg)
29
Generalized hierarchical local approach
• stop classification at an intermediate level if none of the children categories seem relevant
• a category node can be assigned only after all its parent nodes have been assigned
c1
c7c6c5c4
c3c2
c8 c9
![Page 30: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/30.jpg)
30
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 31: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/31.jpg)
31
New global hierarchical approach
• Make a dataset consistent with a class hierarchy– add ancestor category labels
• Apply a regular learning algorithm– AdaBoost
• Make prediction results consistent with a class hierarchy– for inconsistent labeling make a consistent decision
based on confidences of all ancestor classes
![Page 32: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/32.jpg)
32
New global hierarchical approach
• Hierarchical (shared) attributes
sportsteam, game,winner, etc.
hockeyNHL, Senators, goalkeeper, etc.
footballSuper Bowl, Patriots,
touchdown, etc.
![Page 33: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/33.jpg)
33
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 34: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/34.jpg)
34
Evaluation in TCc1
c7c6c5c4
c3c2
Correct category
Incorrect category
predicted total
predictedcorrectly precision
categoryin total
predictedcorrectly recall
0,)1(
measure-F2
2
RP
RP
![Page 35: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/35.jpg)
35
Weaknesses of standard measures
P(H1) = P(H2) = P(H3)
R(H1) = R(H2) = R(H3)
F(H1) = F(H2) = F(H3)
c1
c7c6c5c4
c3c2
H1c1
c7c6c5c4
c3c2
H2c1
c7c6c5c4
c3c2
H3
Ideally, M(H1) > M(H3) and M(H2) > M(H3)
![Page 36: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/36.jpg)
36
Requirements for a hierarchical measure
1. to give credit to partially correct classification
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
M(H1) > M(H2)
H1 H2
![Page 37: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/37.jpg)
37
Requirements for a hierarchical measure
2. to punish distant errors more heavily:– to give higher evaluation for correctly classifying one
level down comparing to staying at the parent node
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
M(H1) > M(H2)
H1 H2
![Page 38: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/38.jpg)
38
Requirements for a hierarchical measure
2. to punish distant errors more heavily:– gives lower evaluation for incorrectly classifying one
level down comparing to staying at the parent node
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
M(H1) > M(H2)
H1 H2
![Page 39: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/39.jpg)
39
Requirements for a hierarchical measure
3. to punish errors at higher levels of a hierarchy more heavily
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
c1
c7c6c5c4
c3c2
c8 c9 c10 c11
M(H1) > M(H2)
H1 H2
![Page 40: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/40.jpg)
40
Advantages of the new measure
• Simple, straight-forward to calculate
• Based solely on a given hierarchy (no parameters to tune)
• Satisfies all three requirements
• Has much discriminating power
• Allows to trade off between classification precision and classification depth
![Page 41: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/41.jpg)
41
Our new hierarchical measure
c1
c7c6c5c4
c3c2
Correct category
Incorrect category
Correct category+ all its ancestors(excluding root)
predicted total
predictedcorrectly precision
categoryin total
predictedcorrectly recall
0,)1(
measure-F2
2
RP
RP
![Page 42: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/42.jpg)
42
Our new hierarchical measure
c1
c7c6c5c4
c3c2
H1 c1
c7c6c5c4
c3c2
H2 c1
c7c6c5c4
c3c2
H3
correct: {c4} {c2, c4}
predicted: {c2} {c2}
{c4} {c2, c4}
{c5} {c2, c5}
{c4} {c2, c4}
{c7} {c3, c7}
1|}{||}{|
)(2
21
c
cHhP
21
|},{||}{|
)(42
21
cc
cHhR
21
|},{||}{|
)(42
22
cc
cHhR
21
|},{||}{|
)(52
22
cc
cHhP 0
|},{||{}|
)(73
3 cc
HhP
0|},{|
|{}|)(
423
ccHhR
![Page 43: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/43.jpg)
43
Measure consistency
• Definition [Huang & Ling, 2005]:f, g – measures on domain R = {(a,b)|a,b , f(a)>f(b), g(a)>g(b)}S = {(a,b)|a,b , f(a)>f(b), g(a)<g(b)}f is statistically consistent with g if |R|>|S|
• Experiment: – 100 randomly chosen hierarchies– New hierarchical F-measure and standard accuracy
were consistent on 85% of random classifiers (|R|>5|S|)
![Page 44: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/44.jpg)
44
Measure discriminancy
• Definition [Huang & Ling, 2005]:f, g – measures on domain P = {(a,b)|a,b , f(a)>f(b), g(a)=g(b)}Q = {(a,b)|a,b , f(a)=f(b), g(a)>g(b)}f is statistically more discriminating than g if |P|>|Q|
• Examples:
c1
c7c6c5c4
c3c2
H1c1
c7c6c5c4
c3c2
H2c1
c7c6c5c4
c3c2
H3
For one accuracy value - 3 different hierarchical values
![Page 45: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/45.jpg)
45
Results: Hierarchical vs. Flat
levels branching Flat H. AdaBoost
2 2 68.30 76.22
3 2 58.35 74.21
4 2 44.90 73.22
5 2 20.88 72.70
2 3 53.47 63.45
3 3 29.51 60.69
4 3 2.67 58.22
2 4 41.35 55.25
3 4 6.98 50.70
2 5 29.99 47.87
Synthetic data (hierarchical attributes)
![Page 46: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/46.jpg)
46
Results: Hierarchical vs. Flat
levels branching Flat H. AdaBoost
2 2 61.69 65.95
3 2 42.47 51.53
4 2 24.49 40.18
5 2 8.45 32.61
2 3 41.53 48.02
3 3 14.50 29.97
4 3 0.79 21.91
2 4 26.72 35.01
3 4 2.46 19.70
2 5 17.14 27.12
Synthetic data (no hierarchical attributes)
![Page 47: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/47.jpg)
47
Results: Hierarchical vs. Flat
dataset Flat H. AdaBoost
newsgroups 75.51 79.26
reuters 87.06 88.31
Real data
![Page 48: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/48.jpg)
48
Results: Hierarchical vs. Local
levels branching Local H. AdaBoost
2 2 73.42 76.22
3 2 69.40 74.21
4 2 68.18 73.22
5 2 68.44 72.70
2 3 61.99 63.45
3 3 58.81 60.69
4 3 57.40 58.22
2 4 54.26 55.25
3 4 50.66 50.70
2 5 47.26 47.87
Synthetic data (hierarchical attributes)
![Page 49: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/49.jpg)
49
Results: Hierarchical vs. Local
levels branching Local H. AdaBoost
2 2 59.83 65.95
3 2 44.00 51.53
4 2 33.44 40.18
5 2 26.03 32.61
2 3 43.87 48.02
3 3 26.33 29.97
4 3 17.97 21.91
2 4 32.51 35.01
3 4 17.96 19.70
2 5 26.04 27.12
Synthetic data (no hierarchical attributes)
![Page 50: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/50.jpg)
50
Results: Hierarchical vs. Local
dataset Local H. AdaBoost
newsgroups 80.01 79.26
reuters 89.11 88.31
Real data
![Page 51: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/51.jpg)
51
Outline
• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning
method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics
![Page 52: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/52.jpg)
Application to Bioinformatics
• Functional annotation of genes from biomedical literature
![Page 53: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/53.jpg)
53
Learning (from fully-annotated genes in the db)
ID Symbol Name Medline reference Evidence GO ID …
S0007287 15S_RRNA PMID:6261980 ISS GO:0003735S0007287 15S_RRNA PMID:6280192 IGI GO:0006412
S0004660 AAC1ADP/ATP
translocatorPMID:2167309 TAS GO:0005743
… … … … … … …
Genomic database (SGD)
retrieve GO codes and IDs of Medline entries from the db records
1
![Page 54: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/54.jpg)
54
Learning (from fully-annotated genes in the db)
Medline
retrieve the corresponding Medline abstracts2
PMID Abstract
PMID:6261980 Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast
Sor F, Fukuhara H.
We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of yeast mitochondrial genome. …
PMID:6280192 Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of mtDNA.
Fox TD, Staempfli S.
A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre (TAA) mutation by DNA sequence analysis. …
PMID:2167309 Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of distinct AAC1 and AAC2 proteins in yeast.
Gawaz M, Douglas MG, Klingenberg M.
AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner membrane. …
… …
![Page 55: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/55.jpg)
55
Learning (from fully-annotated genes in the db)
Training set
form the training set: words from Medline abstracts(features) and GO codes (categories)
3
Abstract GO ID Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast
Sor F, Fukuhara H.
We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of yeast mitochondrial genome. …
GO:0003735
Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of mtDNA.
Fox TD, Staempfli S.
A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre (TAA) mutation by DNA sequence analysis. …
GO:0006412
Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of distinct AAC1 and AAC2 proteins in yeast.
Gawaz M, Douglas MG, Klingenberg M.
AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner membrane. …
GO:0005743
… …
![Page 56: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/56.jpg)
56
Learning (from fully-annotated genes in the db)
LearningAlgorithm
Classifier:Abstracts GO codes
learn a classifier from the training set4
![Page 57: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/57.jpg)
57
Classification (for genes with missing annotation)
Gene Abstract
YLL057C
Cloning and characterization of a sulfonate/alpha-ketoglutarate dioxygenase from Saccharomyces cerevisiae.
Hogan DA, Auchtung TA, Hausinger RP.
The Saccharomyces cerevisiae open reading frame
YLL057c is predicted to encode a gene product with 31.5% amino acid sequence identity to Escherichia coli taurine/alpha-ketoglutarate dioxygenase and 27% identity to Ralstonia eutropha TfdA, a herbicide-degrading enzyme. …
Medline
retrieve Medline abstracts mentioning the gene1
![Page 58: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/58.jpg)
58
Classification (for genes with missing annotation)
Classifier:Abstracts GO codes
classify these abstracts in GO codes2
Gene GO code GO function
YLL057C GO:0006790 sulfur metabolism
![Page 59: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/59.jpg)
59
Results
dataset level branching Flat Local H. AdaBoost
biol. process 12 5.41 15.06 59.27 59.31
mol. function 10 10.29 8.78 43.36 38.17
cell. component 8 6.45 44.18 72.07 73.35
![Page 60: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/60.jpg)
60
Conclusion
We have presented:• hierarchical categorization task
(categories are partially ordered)• generalized hierarchical local approach• new hierarchical global approach
(hierarchical AdaBoost)• new hierarchical evaluation measure• application to gene annotation task
![Page 61: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/61.jpg)
61
Future work
• to try global hierarchical approach with other learning algorithms
• to extend the gene annotation training sets with similar documents from Medline
• to perform similar task for other organisms• to use gene annotations in gene classification
and clustering
![Page 62: Hierarchical Text Categorization and its Application to Bioinformatics](https://reader034.vdocuments.us/reader034/viewer/2022042718/56813419550346895d9b0503/html5/thumbnails/62.jpg)
62
Gene expression analysis with functional annotations
GO:0006790
GO:0006798
GO:0007315
GO:0007289
GO:0002132
GO:0002166