illinoistimes.cs.uiuc.edu/czhai/pub/p5.doc · web viewwe would like to thank the reviewers for...

Principal Investigator/Program Director (Last, First, Middle): Lu, Xinghua

INTRODUCTION TO AMENDMENTWe would like to thank the reviewers for their insightful and constructive critiques. Based on the reviewers’ comments, we have substantially modified the application with modification indicated by the bars on the right margin. There are few critiques were commonly pointed out by all three reviewers while some are distinct specific concerns. In this introduction, we will first summarize and address the common concerns and comments from all three reviewers and then address specific individual critiques.Combined Comments 1. Significance, innovativeness, investigators and environment.

We are highly encouraged by the reviewers’ unanimous positive comments on the significance, innovativeness, investigators and environment for the proposed studies. The reviewers all pointed out that proposed work was significant for its “obvious potential impact” to the field that “is unmistakably significant.” The reviewers also agreed that “all aspects of the proposed work are innovative” and the work reflects an “exciting approach that clearly deserves to be tested.” We especially appreciate the expressed trusts stating that the investigators are “an excellent team” with concrete collaborative productivity and are “well qualified” for the project.

2. Evaluation of annotation assistant system. All the reviewers shared a common concern regarding the evaluation of the annotation assistant system proposed as the Specific Aim 3 in the previous submission. The major concern was related to the lack evaluation of usability of system for which the investigators do no have strong experience. The second reviewer also pointed out that the effort devoted to development of such a system was premature due to uncertain impact. We agree with the reviewers that, at current stage, it is premature to propose implementation of an automatic annotation assistant system. Thus, we have removed the original Specific Aim 3 and reduced the duration of the study to 3 years accordingly. We have kept the research components of original Specific Aim 3 and fold them into the current Specific Aim 2. We believe that this restructure of specific aims will allow us to concentrate on statistical semantic analysis, developing annotation and information retrieval algorithms as the essential building blocks for the future implementation of the system. We will relegate implementation and evaluation of such a system to the competitive renewal period of this project.

Specific Comments from Individual reviewersIn follow paragraphs, we address the specific concerns raised by individual reviewers. If space allow, we will copy the reviewer’s comments verbatim (shown in italic font), otherwise we show the beginning and end of the paragraph of comments to indicate which comment we are addressing. Critique 1: 1. One minor point was that they seem to have some confusion in their related work where they

implied that a binary classifier equates to each document getting only one classification, but that is not true of binary classifiers – it just means each document is evaluated for one class at a time. We would like to thank the reviewer for clarifying this point. We understand that multiple binary classifiers can be trained and then applied to a document to perform multiple one-vs-rest classification. Thus a document can be labeled (annotated) with multiple classes. Indeed, the naïve Bayes text classifier was proposed as the baseline annotation algorithm for comparison with the proposed methods in the Sections D.1.4 and D.2.1 of the original proposal. However, judging from the concerns raised by this and second reviewers, we believe that the design was not well presented. In this amendment, we devote a new subsection (Section D.1.6) to describe such experiment in detail.

2. Another minor point was that they propose to use full text only on articles in PubMed Central, but they likely could obtain many more full-text articles at their institution through MEDLINE, although those articles would not necessarily be publicly available.We would like to thank the reviewer for pointing out this interesting direction, which is adopted in the amended proposal.

PHS 398/2590 (Rev. 09/04, Reissued 4/2006) Continuation Format Page


Critique #2: 1. Although impressive claims are made for this technology compared to other approaches, the

preliminary evidence provided for these claims is quite modest. … For example, the top GO terms associated with the strongest topic (table 2) had mutual information measures in the range of 10^-3 and below, not particularly impressive.We would like to point out that the MI is not a normalized quantity, and its absolute value is usually dependent on the sample size due to the empirical estimation of the probability mass P(X, Y) in Equation . In most case, the larger the sample size, the smaller the absolute MI value of a given joint event, (X, Y). Thus, it can not be used alone as the criteria for judging goodness.

2. The use of the term "concept" and "semantic content" as a synonym for the probabilistic topic definition is idiosyncratic and sometimes deceptive. We have “overloaded” the term of “semantic” follow the convention the latent semantic indexing (LSI) (Deerwester et al., 1990), which is widely used in the information retrieval field and is closely related to our approach. In this is context, the “semantic content” refers to overall topicalities of a document, which is somewhat higher level abstraction than the “semantic” in conventional natural language processing context, which more concentrates at the words and phrases level (Jurafsky and Martin, 2000).

3. A rethinking of precisely what hypotheses need to be tested to demonstrate the value of this approach, and a reorganization of the research plan to focus on clear tests of these hypotheses would strengthen the proposal.We appreciate and adopted this insightful suggestion. We have restructured the specific aims of the proposal, by clearly specifying the goals of experiments and connecting them with the overall specific aims.

4. Another general problem is the failure to provide reasonable points of comparison to the proposed algorithms. Many generic statements are made about "improvements," but appropriate baselines are not provided. Reorganization of the specific aims (see above) leads to more clearly defined subtasks. In the amended proposal, we have specified baseline for most specific experiments.

5. Another general problem is the consistent failure to acknowledge possible problems in the research plan, and to provide suitable alternative approaches.In original proposal, the alternative approaches for achieving overall goal of each Specific Aim were implicitly embodied by the variety of experiments in the original proposal. For example, for identifying descriptive semantic topics, we proposed fine-tuned LDA, mixture of LDA analyzers and multivariate information bottleneck. These experiments complement each other, thus they serve as alternative approaches for each other. To further strengthen our study, we also explicitly discuss the potential difficulties associated with each experiment and provided alternative approaches to address the problems.

6. The paragraph begin with: The training corpora described in section D.1.1 are potentially problematic …. Neither of the latter corpora contains this information. We have re-stated the goals of using the large MEDLINE and full text journal article corpora and specified the hypotheses to be tested in the experiments utilizing these corpora, see Section D1.1 for detail. Briefly, the goals for utilizing large MEDLINE corpus are: (1) developing and evaluating probabilistic topic model capable of identifying semantic topics across various domains of biomedical knowledge; (2) identify more specific semantic topics as the basis for protein annotation; and (3) train and evaluate novel information retrieval algorithms. Development of these tools has broader impact beyond protein annotation, e.g., they can be applied in biomedical literature indexing with MeSH. The reviewer concerns regarding the difficulty of evaluating biological-relevance of topics and the lack of GO annotation are well warranted. However, these potential problems can be addressed by utilizing MeSH terms associated with every MEDLINE document. With recent increasing number of studies on the lexicons of GO system (McCray et al., 2002; Ogren et al., 2005) and mapping between GO and unified medical language system (UMLS) (Lomax and McCray, 2004), it is possible to map the topics (a distribution of lexicon) to GO terms either manually or automatically. As for training of correspondence LDA (Corr-LDA) and multivariate IB for automatic annotation, we will continue utilize the GOA corpus as proposed in the original proposal.



7. The mixture of LDAs idea in D.1.3 is not well developed. …The "partial manual evaluation of their quality" is not an adequate evaluation plan.We have rewritten the subsection to make it clearer and explained the claim. The claim that mixture of LDA may outperform flat LDA is made based on a theoretic observation. The Dirichlet prior of a flat LDA model would assign a non-zero prior probability for every topic to a document. Thus, a flat LDA will entertain the possibility that all topics exist in a document and assign words to them, even to those unrelated to the main topics of a given document at all. This will lead to deteriorated performance when the number of topics used to model corpus becomes large. The mixture of LDA analyzers model alleviates this problem by grouping documents with similar topic contents into clusters and modeling documents within a cluster with relatively small number topics. Overall diversity of topics is captured by different LDA component in the mixture model.

8. Although the idea of changing the task so that texts are mapped to groups of GO terms (section D.1.4) is interesting in many ways, … e.g. the category associated with nucleolus has as its most frequent term the stem "ribosom", an entirely separate cell component! We agree that it is highly desirable to annotate protein with GO terms as specific as possible as adopted by human curators, however such annotation is not readily achievable for a contemporary computational agent, due to both data sparseness and the lack of capability of full semantic and syntactic understanding of natural language by the agent (Hunter and Cohen, 2006). Thus, rather than giving up or attempting futilely using very few training cases to learn classifiers for specific annotations, it is sensible to reach a middle ground by relaxing the goal to annotating with the representative concepts and their corresponding GO terms. Hopefully, the concepts extracted by the proposed methods represent the recurring concepts of the studied corpora, and whose information contents are as distinct or specific as supported by data. Achieving such a goal would lay a basis for leaping towards the goal of more specific annotation by combining probabilistic topic model, natural language processing and information extraction. The reviewer’s concerns regarding at what level and which GO term should be used to represent the concepts are the crux of the problem. To address the first question, we believe that the proposed Bayesian model selection and information bottleneck provide principled approaches for determining the level of the concepts based on the observed data and the information content of the concepts. As for the second question, we propose to apply manual and automatic approaches. The latter is to use GO categorizer (Joslyn et al., 2004). This is a principled approach for selecting a most representative GO term among a cluster of GO terms by studying their relationships on the GO graph and producing an ordered list of candidate GO terms. As for the specific example mentioned by the reviewer, we believe that the term “ribosom” is the stemmed form of the term “ribosomal” rather than “ribosome,” which correctly fits the context of nucleolus, where ribosomal RNA is processed. This example further supports our notion that, without human-like understanding of free text, probabilistically detecting topical context is more robust than attempting deterministically to assign highly specific annotations based on few specific words.

9. The comparison described in section D.2.3 to "an algorithm similar to the best-performing... A fairer approach might look at multiple test corpora (e.g. using the GENIA corpus), or use the system in future competitions (such as future TREC Genomics tasks).We agree with the reviewer regarding the difficulty and uncertainty in the value of the experiment, thus we have removed it from the proposal.

Critique 3: We would like to thank the reviewer’s overall positive comments on the proposal. The reviewer’s specific concern regarding evaluation of the annotation systems is address at the beginning of this introduction.



A. SPECIFIC AIMSThe long term goal of this project is to develop methods to facilitate automatic protein annotation based on the biomedical literature. Understanding the function of proteins has been and remains to be a major task of biomedical research. The knowledge of protein is fundamental for understanding biologic systems, mechanism of disease and human health. A perpetual task of biomedical informatics is to acquire and represent the current and future knowledge resulting from biomedical research. The knowledge should be represented in the languages that are understandable to computational agents, so that it can be stored, retrieved, and used for reasoning and discovering new knowledge. Currently, all literature-based protein annotations are performed manually which, unfortunately, is extremely labor-intense and cannot keep up with the pace of the growth of information. Indeed, with the completion of genome sequences of several organisms, manual annotation of proteins has already become a rate-limiting step that hinders acquisition of critical function information of the large number of proteins from the exploding amount of biomedical literature. In this study, we propose to develop and apply novel approaches based on probabilistic semantic analysis and information theory to enhance current effort of automatic annotation.Specific Aim 1: Identify/extract descriptive biological concepts from biomedical literatures. We will develop and extend algorithms based on advanced statistical semantic analysis and information theory to identify a set of descriptive biological concepts from biomedical literatures. To achieve this aim, we propose the following complementary approaches:

1. Identify descriptive and specific biological concepts by enhancing the latent Dirichlet allocation (LDA) model through fine-tuning the model parameters, incorporating more flexible parameter settings and training with large text corpora.

2. Develop a mixture of LDA analyzer to model the literatures from various biomedical domains.3. Identify informative biological topics using information bottleneck (IB) approaches. 4. Associate the identified biological topics with the most representative Gene Ontology (GO) terms

through manual and automatic annotation. The overall hypotheses for this specific aim are: (1) proposed methods improves over the reported latent Dirichlet allocation (LDA) model (Zheng et al., 2006), in term of language modeling and the quality of identified topics. This hypothesis will be tested through evaluating and comparing the goodness-of-fit of these models on the studied corpora. The quality of the extracted concepts will be manually evaluated and compared through statistical significance analysis. (2) The identified biological concepts can be used to enhance automatic proteins annotation by alleviating data sparseness and annotation inconsistency. To test this hypothesis, we will re-annotate the corpus from the Gene Ontology Annotation (GOA) project with these concepts. The original and re-annotated GOA corpora will be used to train the state of art text classification algorithms, and the impact of re-annotation will be evaluated in term of classification/annotation accuracy.

Specific Aim 2: Develop automatic annotation algorithms. We will develop and evaluate various algorithms serving as building blocks for a future automatic protein annotation system. More specifically, we will concentrate on the following subtasks:

1. Develop novel information retrieval methodologies to efficiently retrieve documents relevant to specific proteins through structured query expansion and probabilistic language modeling.

2. Extend the multivariate IB methods to perform automatic GO annotation based on free text. 3. Develop a novel text-based correspondence LDA model to perform automatic GO annotation based on

free text. 4. Develop novel probabilistic text segmentation algorithm to extract the supporting evidence for automatic

GO annotation. The overall hypotheses for this aim are: (1) The query expansion methods will improve over the current state of art information retrieval techniques in this domain, which will be tested through evaluating and comparing the retrieval performances. (2) Correspondence LDA and IB approaches for automatic annotation will improve over the baseline text classification approach. This will be tested through evaluating annotation accuracy using both standard information retrieval metrics. (3) The novel text segmentation algorithm combines the strengths of LDA and hidden Markov model and the enhanced text segmentation is useful in extracting supporting evidence.



B. BACKGROUND AND SIGNIFICANCEB.1. Need for automatic protein annotationRecent years have witnessed an increasing interest in the field of biomedical language processing (Hunter and Cohen, 2006; Shatkay and Feldman, 2003), a discipline of applying computational language processing techniques to extract and represent biomedical knowledge. In this application, we will address a critical and rate-limiting step in acquiring and representing knowledge regarding protein functions—annotation of proteins with controlled vocabulary. This is a process that transforms biomedical knowledge or concepts from free text of a human language into a language that is understandable to computational agents, so that the knowledge can be stored, retrieved, reasoned upon, and used for discovering new knowledge. The most widely used controlled vocabulary in the bioinformatics field is the Gene Ontology (Ashburner et al., 2000). The GO is a dynamic controlled vocabulary consisting of terms that represent the biological concepts and/or objects. The vocabulary is divided into three categories: biological process, molecular function, and cellular component, which respectively describe (1) in which biological process a protein participates; (2) what molecular functions a protein has; and (3) in which subcellular component a protein resides. The relationship of the GO terms (concepts) forms a directed acyclic graph (DAG), where each node is a GO term and the directed edges represent the relationship between the nodes. Currently, most knowledge databases of the many organisms, e.g., yeast, human, and mouse genome databases, use GO to annotate proteins. Annotations based on biomedical literature are typically performed by PhD-level human curators (Hersh et al., 2004). This is a critical step in knowledge acquisition and representation of the ever-increasing number of known and putatively predicted proteins based on the exploding amount of information in biomedical literatures. It is obvious that manual annotation cannot keep up with the pace of the growth of information and is hampering this knowledge acquisition process and there is an urgent need for developing methods that facilitate the process by either extracting essential relevant functional information or better, directly performing automatic annotation. We aim at developing new methods to achieve both goals. B.2 Existing approaches to protein annotation Knowledge transfer There exist some automatic annotation techniques that transfer/translate previous knowledge of proteins from other forms into the GO terms. For example, (1) the Gene Ontology Annotation (GOA) project maintains a mapping between the Swiss-Prot (Bairoch et al., 2005) database keyword or enzyme commission (EC) number to the GO terms. Then, GO terms can be assigned to the proteins with known keywords and EC number (Camon et al., 2004); (2) the GO terms of a well annotated protein can be transferred to the proteins with high sequence similarity (Xie et al., 2002); (3) conserved protein functional motifs have been manually annotated with GO terms, and then the GO terms can be assigned to the proteins that match the motifs (Biswas et al., 2002; Camon et al., 2004; Hayete and Bienkowska, 2005); (4) The PI and the Co-PI have developed algorithms to automatically associate GO terms with newly found protein motifs (Lu et al., 2004b; Tao et al., 2004), and such knowledge can be further transferred to other proteins matching these motifs. In our previous work, we demonstrated that well represented knowledge could be used to acquire new knowledge. This observation motivated us to work on this current project of automatic acquisition and representation of knowledge. However, these knowledge transfer approaches cannot sustain the ongoing process of protein annotation, which would require the capabilities of automatically acquiring new knowledge from future biomedical literature.

Literature-based Literature-based methods have been reported in several previous studies (Aubry et al., 2006; Blaschke and Valencia, 2002; Raychaudhuri et al., 2002; Shatkay et al., 2000; Xie et al., 2002) in which various techniques have been tested. However, due to the lack of a unified evaluation, it is hard to compare the results from these studies. Recently, the demands for literature-based annotation techniques prompted several special conferences dedicated to the task. For example, the Critical Assessment of Information Extraction Systems in Biology (BioCreative) conference in 2004 (Hirschman et al., 2005) and the Genomics Track of the Text Retrieval Conference (TREC) in 2003, 2004 and 2005 (Hersh et al., 2005; Hersh et al., 2004) had specific tasks related to assigning GO terms to text documents. The tasks of these contests were designed based on the suggestions and requests of the curators who performed annotations at the European Bioinformatics Institute (EBI) and the Mouse Genomic Informatics (MGI). Thus, the tasks reflected the real world urgent needs in the field. Furthermore, the results of these conferences arguably reflect the current



state of the art approaches, in that the participants were researchers from all over the world who are working on related topics. 1. BioCreative Conference. Task 2 of the contest is related to automatic annotation. The subtask 2.2 was,

given a protein and the full text of the associated papers, to assign appropriate GO terms to the protein according to the literatures and return the text provided evidence for the assignment. The results were judged by human curators at the EBI. Task 2.2 was to annotate protein with GO terms based on given full text literature, and extract the evidence supporting such annotation (Hirschman et al., 2005), which is highly relevant to our overall aims. Various techniques, mainly conventional rule-based or classification-based methods, were tested by nine participating groups from around the world, and the organizer of the contest commented that the task was “the most difficult” one among all the tasks of the contest.

2. TREC Conference. The organizers of the TREC 2004 and TREC 2005 designed a series of text categorical tasks that required the participants to classify the documents into different groups, a task similar to automatic annotation. In TREC 2004, the task was only to determine whether a document contains experimental evidence that warrants the assignment of a GO term, without requiring assignment of a specific GO term (Hersh et al., 2004). In TREC 2005, the text categorization tasks were expanded to four categories, to retrieve documents related to the following topics: allele mutations, embryonic development, GO-containing documents as in TREC 2004, and tumors. Although these categories are very general, difficulties of training classifiers with sparse high-dimensional training cases already began to emerge. We participated in the TREC 2005 categorization task, and we found that semantic analysis significantly enhances the performance of a state of art text categorization algorithm (Lu et al., 2006). Thus, our results demonstrate that capability of extracting semantic information not only lay the foundation for the proposed methods in Section D.2., but also can be used to enhance the performance of annotation based on the conventional classification techniques.

B.3 Limitations of existing annotation approaches The major weakness of the knowledge transferring approaches is that they cannot capture future knowledge in the literature. Approaches based on literature have not been able to deliver results of sufficiently high precision either. A fundamental limitation of the existing literature-based approaches is that they are inadequate in capturing the semantic content of text. As a result, the performance is promising, but not good enough for bringing real benefits to biologists through automatic annotation. We now discuss two specific difficulties encountered by many participants of the two contests. 1. Ambiguity of Natural Language. Polysemy and synonymy are two common phenomena in natural language

that cause ambiguity in natural language. Most existing approaches rely on exact keyword matching, which makes it difficult to deal with polysemy and synonymy. For example, a common approach adopted by many participants for Task 2 of BioCreative is to cast the annotation task as an information extract (IE) task—identify the sentences containing protein entity name; use the words of the sentence to search/align against GO term definitions; rank the GO terms according to some schemes; and assign the GO terms with the highest scores to the article. A common observation made by these participants was that the words defining the GO terms were rarely directly observed within the sentences containing the gene entity and quite often, the synonyms were observed. This observation reflects the fact that existing methods can not extract and utilize the semantic information to resolve the ambiguities common in natural language, e.g., the direct word matching method is simply unable to capture the semantic similarity of different expressions. In our methods, semantic context information will be exploited in a principled way to resolve these ambiguities.

2. Multiple subtopics and biological concepts in one document. In general, a MEDLINE abstract may have multiple biological concepts and multiple subtopics, and this creates additional difficulty for many existing approaches. A commonly used categorization approach for annotation was to: use the MEDLINE abstracts with GO annotations as a training set; train one binary (yes/no) classifier per GO term; segment the test documents into paragraphs; invoke each GO classifier on the paragraph; and assign GO terms that pass a certain threshold, potentially multiple GO can be assigned. One advantage of the classification approach is that it captures the evidence outside a sentence containing the gene name. Unfortunately, the approach also has the following shortcomings: (1) one classifier per GO does not scale well; (2) many GO term associated with very sparse training documents, see Section C.5.; and (2) the approach ignores the fact



that most training documents were annotated with multiple GO terms because a document contained multiple biological concepts. Thus, a same document is used as a training case for different classifiers, though only part of a document is really relevant for any given GO term. This renders such training inherently inaccurate because the features (words) that provided discriminative power for one class would be noise for another. This observation reflects the fact that a MEDLINE abstract often contains multiple concepts and binary classification approach is inherently non-optimal due to multiple topicalities of documents. In our methods, probabilistic models will be used to directly model the underlying multiple biological concepts and subtopics in a text document, allowing prediction with finer granularity of topics.

Clearly, both difficulties are related to the failure of capturing semantic context of the documents. The methods we will develop differ from all these previous works in that we will exploit advanced probabilistic models to extract and reveal the underlying biological concepts in literature. Indeed, to understand our ideas, it would be instructive to first analyze how a human curator would to perform annotations based on biomedical literature:

When given an article related to a protein, a human curator needs to read the literature, understand the semantic meanings of the text, extract the biological concepts within the document, find the GO terms that most closely represent the biological concepts, and finally assign the GO terms to the protein. The two key steps of the process are extracting biological concepts and mapping the concepts to controlled vocabulary.

Our general hypothesis is that the algorithms that explicitly simulate this process will give rise to better automatic annotation results.

B.4. Related Work in Literature-based IndexingAutomatic Index Initiative of NLM. The Automatic Indexing Initiative project of the NLM is closely related to our specific aims. NLM maintains the MEDLINE/PubMed database, in which all the records are indexed by the medical subject heading (MeSH). Indexing of MEDLINE with MeSH is performed by experienced human curators at NLM. Foreseeing the needs for automatic indexing, the NLM researchers started an Automatic Indexing Initiative in late 1990s (Aronson et al., 2000), which resulted in a sophisticated ensemble system known as Medical Text Indexer (MTI) (Aronson et al., 2004). A characteristic of the system is that it is based on rigorous NLP, linguistic and knowledge representation approaches to simulate the human indexing process: extracting concepts and mapping the concepts to the controlled vocabulary. The project has containing multiple subsystems of which the major components and the workflow of the system are as follows: (1) MEDLINE title and abstracts are fed into a rigorous natural language processing (NLP) subsystem to parse the text into phrases; (2) the parsed phrases are mapped to the biomedical concepts stored in the Unified Medical Language System (UMLS) Metathesaurus (Browne et al., 2003; Campbell et al., 1998; McCray et al., 1993) with the MetaMap subsystem (Lindberg et al., 1993; McCray et al., 1993); (3) the UMLS concepts are further mapped to MeSH (Bodenreider et al., 1998); and (4) MeSH terms are assigned to the text by combining information from several other subsystems based on rules. This system is arguably the most sophisticated automatic annotation system in the biomedical informatics field. Due to these constraints, a system like this is difficult to adapt to a new field like automatic annotation of protein.We observe that the existing mapping algorithms are based on rigorous linguistic approach which, although powerful, cannot handle the uncertainty and ambiguity in the natural language well. For example, one key idea underpinning the MetaMap (Aronson, 2001) program is to generate a large number of lexical variants for a phrase of interest and use them to map the UMLS Metathesaurus. The program has high recall but relatively low precision (Aronson et al., 2004; Pratt and Yetisgen-Yildiz, 2003), due to the fact it can not utilize overall semantic context of a document to resolve which of hit UMLS concept fit the context well. Thus, development of methods capable of detecting overall semantic context will enhance such mapping.

B.5 Summary Automatic annotation of proteins based on biomedical literature is a new frontier of the biomedical informatics research field. The task is difficult, but of both theoretical and practical importance. Current attempts cannot meet the requirements of the real world automatic annotation in terms of accuracy. Analysis of the shortcomings of existing approaches indicates that the unsatisfactory results in the previously studied approaches are due to the inability to explicitly extract biological concepts from text, a critical step of acquiring knowledge. In this application, we hypothesize that probabilistic topic model can be used to explicitly extract



biological concepts from biomedical literature and that the information of semantic context can be used to address the aforementioned difficulties.

C. PRELIMINARY STUDIESC.1. Previous Collaboration The PI and the co-investigator have already collaborated in studying protein function annotation and biology literature mining and have numerous joint conference and journal publications i (Lu et al., 2004b; Lu et al., 2006; Tao et al., 2003; Tao et al., 2004; Zhai et al., 2005). Our algorithm on automatically annotating the function of protein motifs is being adopted by the ProDom database (ProDom). Our combined expertise covers a wide range of disciplines required for the complex task of automatic protein annotation based on literature. Dr. Lu is an assistant professor at the Department of Biostatistics, Bioinformatics and Epidemiology. Dr. Lu has broad training in medical, biological, biomedical informatics, statistical learning and extensive scientific computing experience through his biomedical research career. During his biomedical informatics training at the University of Pittsburgh (supported by the NLM training grant), Dr. Lu acquired strong background in probabilistic graphical models and general statistical learning. He is very familiar with the advance statistical computation techniques such as variational methods and sampling based methods. He has developed novel machine algorithms and applied existing ones in the bioinformatics setting (Lu et al., 2004a; Lu et al., 2004b; Lu et al., 2006; Zheng and Lu, 2006; Zheng et al., 2006). During the NLM training period, Dr. Lu also became familiar with statistical natural language processing (NLP) and information retrieval (IR). His visit to Lister Hill National Center for Biomedical Communication in 2003 especially enhanced his knowledge and expertise in the field. His current working environment provides ample support in statistical aspect. Co-investigator Dr. Zhai is an assistant professor of a top-ranked computer science department (University of Illinois at Urbana-Champaign) and has joint appointment in the Institute of Genomic Biology of the same university. He is a recipient of the Presidential Career Award for Scientists and Engineers (PECASE) award, the highest honor bestowed to young scientists in the nation. He has ten years of experience from both academia and industry on research and development of information retrieval systems. He is the main architect and designer of the Lemur information retrieval toolkit, which is now downloaded and used by people all around the world (available at http://www.cs.uiuc.edu/~lemur). His research on information retrieval has won the best paper award in the top conference of information retrieval, ACM SIGIR 2004 (Fang et al., 2004). Dr. Zhai also has a significant amount of bioinformatics knowledge and has taught a course on “Algorithms in Bioinformatics” in Spring 2004 at UIUC. He is a Co-PI of a $5M National Science Foundation FIBR (Frontiers in Integrative Biology Research) grant, which supports a project on developing a large-scale information system for studying the social behavior of honeybees. We will be able to leverage and reuse the many resources and utility tools developed in that project. Moreover, in 2003 and 2005, our groups collaboratively participated in the TREC genomics information retrieval evaluation task and have proposed improved retrieval algorithms that are more effective for retrieving information about genes from Medline abstracts (Lu et al., 2006; Zhai et al., 2005; Zhai et al., 2003). C.2 Statistical Representation of Text Document When writing an article about a protein, a biologist usually follows these three steps: (1) choose the topics; (2) for each topic, select appropriate words to convey the concept; and (3) grammatically organize the selected words. All these steps can be represented or simulated with probabilistic approaches as discussed below.The first step is to decide what topics to discuss in the article, e.g., the function of the protein, its interaction with other proteins and cellular location of the protein, etc. In general, an article can be thought of as a mixture of multiple concepts/topics. In addition, certain biological concepts are correlated, e.g., the concepts transcription factor activity and nucleus are more likely to co-exist in an article (Lu and Hunter, 2005). From the viewpoint of statistical modeling, these characteristics can be captured by a statistical mixture model equipped with a topic distribution (context) parameter, which specifies what topics to include, and how words are distributed among these topics. Note that allowing multiple topics to exist in a document effectively solves one difficulty encountered by existing methods.


http://www.cs.uiuc.edu/~lemur


The second step is to choose words to represent the concepts. For example, when discussing the concept apoptosis, the author is very likely to use the words such as “apoptosis,” “programmed,” “death,” and “cytochrome”. From statistical point of view, a topic/concept can be represented by a word-usage distribution. Such representation captures the dual relationship between words and concepts: concepts are expressed by choice of words and the meaning of a word depends on its context. Figure 1 illustrates how the word-usage pattern can be used to represent topics. Such representation effectively resolves the ambiguities introduced by polysemy and synonym, which plague previously studied annotation systems. The third step is to grammatically organize the words to fully convey the ideas. The syntax and ordering of words can be represented by various statistical sequential models, such as the hidden Markov model or other language models (Manning and Schutze, 1999). Understanding a written article is a reversed process in which a reader needs to fully understand the lexical meaning of words, the syntax of the language, resolve any ambiguity based on the context and extract the concepts/topics within the article. While the current state of the art methods for computational natural language processing (NLP) can not yet achieve full semantic understanding (Hunter and Cohen, 2006), it is possible to infer the existence of a topic simply based on observations of a coherent cluster of words, without requiring them to be arranged syntactically. For example, most biologists would have no problem to infer the existence of the concept apoptosis upon seeing “cell,” “death,” and “caspase” co-occurring in the same text. Explicit representation of topic distribution and word distribution allows a probabilistic topic model to perform such inference.

C.3 Probabilistic Topic Model The probabilistic topic models are a family of statistical generative models, in which a text document is represented as a mixture of words from different topics (see Figure 3 for an example), and topics are represented with word distributions. Given a corpus of documents, these models can learn/extract the topics from the corpus by capturing the word usage patterns; and with a trained model, one can infer what topics exist in a new text document. There are several probabilistic topic models (Griffiths and Steyvers, 2004; Hofmann, 1999a; Hofmann, 1999b; Zheng et al., 2006) including the recently developed latent Dirichlet allocation model (LDA). The LDA model was developed by Blei et al (Blei et al., 2003) to explicitly simulate the document “generation” process discussed in previous subsection. We have implemented a Gibbs sampling based inference algorithm for LDA (Zheng et al., 2006) and demonstrated that the model is capable of identifying biologically meaningful topics from a collection of protein-related MEDLINE documents.LDA Model Specification and Statistical Inference. The LDA model is a probabilistic graphical model, whose graphical representation in the form of “plate” notation (Buntine, 1994) is shown in Figure 2. In this notation, an instance of a data structure is represented as rectangle plate, e.g., a document is represented the large plate in the figure. The total number of instances of a data structure is indicated by a variable at the right bottom of the plate, e.g., |C| stands for the total number of documents in a corpus. The nodes of the graph represent the variables in the model, and a directed edge indicates the probabilistic relationship between the two variables it connects. The LDA model simulates the processes of “generating” a text document d as follows: (1) the topics content parameter vector for the document is first sampled from a Dirichlet distribution governed by ; (2) for each word in the document, a


apoptosis

caspase

cytochrome Celectron

ATP

Respiration

Figure 1. Representing concepts with word distributions. Bar length indicates the probability.

Apoptosis

ATP

apopotosis

cytochrome C

electron

caspase

|C|Nd

wz

T

Figure 2. DAG representation of the LDA model. A node represents a variable and a shaded node indicates an observed variable. Each rectangle plate represents a replica of the data structure; Nd and |C| indicate the copies of the replicas.


topic indicator (label) variable z is sampled from the multinomial distribution governed by the topic content parameter ; (3) based on the topic label, a word is sampled from a multinomial distribution governed by parameter z. This process is repeated until all Nd words of the document are generated.

Training of the LDA model is an unsupervised process that reverses the generation process. Given a corpus without information of what topics are used to generate it, the goal of the inference algorithm is to (a) identify the topics (word distributions) that generate the corpus; (b) find out what topics exist within each document (topic distribution); and (c) infer to which topic (represented by zi) a word, wi, in a text belongs. We applied a Gibbs sampling inference algorithm to estimate the parameters and latent variables, which was discussed in detail in our recent report (Zheng et al., 2006). The strategy of the algorithm is to estimate/instantiate the topic labels, z, by iterative sampling the topic label zi for each word wi from the posterior distribution, p(zi | z-i, w) using to the Markov chain Monte Carlo (MCMC) approach (Andrieu et al., 2003). The algorithm is as follows: at beginning, the latent topic variables z are randomly instantiated; a Markov chain is started, which updates zi

iteratively; the chain is run until it converges to the target distribution p(z | w) (“burn in”); and then samples are collected from the chain. The conditional distribution p(zi | z-i, w) is shown below and more detailed notations is available from (Zheng et al., 2006).

Equation has an intuitive explanation for how the inference algorithm determines the topic label zi for a word wi. The left side of equation is asking: “what is the probability that wi is generated by topic j, given all the observed words and other topic labels in this corpus?” To answer the question, the first term on the right side tells us to consider the likelihood of observing wi if it is from topic is j, e.g., the likelihood of observing word “programmed” if the topic is apoptosis. Furthermore, we need to consider whether topic j is a major topic of the document judging from the topic context of the text, which we can learn from the second term. Translated to English, the second term would read: “the label zi is more likely belongs to topic j, if many other words in the document belong to the topics j.” This is equivalent to saying that the word “programmed” is more likely to belong to the topic apoptosis, if there are other words in the document belonging to the same topic, such as “apoptosis” “death” and “cell.” Thus, the topic inference process by the LDA model captures the key relationship between words and concepts and agrees with the human inference process very well.

C.4 Identify Biological Concepts from TextData We implemented the Gibbs sampling inference algorithm of the LDA model in the C language. We constructed a corpus of MEDLINE documents constructed based information from the Gene Ontology Annotation project (Camon et al., 2004), which we refer to as GOA corpus. The GOA project maintains GO annotations for the proteins in the Uniprot (Bairoch et al., 2005) database. Each protein was annotated with one or more GO terms and some of these annotations are associated with PubMed identification numbers (PMID), indicating that the annotation is based on reading the literature. We extracted all the PMID and associated GO terms from the data set. The corresponding MEDLINE documents were downed from NLM web site and the titles and abstracts of the documents were extracted. The corpus consists of 25,005 MEDLINE records. We preprocessed the corpus by removing the “stop words”, stemming the words with the Porter’s stemmer (Porter, 1980) and discarding the words that occurred fewer than 5 times in the corpus. Extract Semantic Concepts. When applied to the GOA corpus, the LDA model not only identified the semantic concepts/topics from the corpus but also probabilistically labeled the words of documents to indicate which topic likely has generated the observed words. Figure 3 shows an example of a MEDLINE abstract labeled by the LDA model. In this figure, the words not in the “stop words” list were labeled by a topic index number and we highlighted the words from the two major topics of the abstract: #73 and #147. The 10 words with highest probability for topic #73 are: “mitochondri, mitochondria, cytochrom, inner, outer, respiratori carrier, mtdna, space, nadh”; while the 10 highest probability words for topic #146 are: “apopotosi, death, caspas, apoptot, induc, bcl, fa, surviv, program, bax”. Clearly, we can see that #73 is a topic regarding the mitochondrion and #147 is related to apoptosis. Note that the words, “outer” and “space”, usually should be considered as neutral words which may mean different object in different context. Here, they were correctly labeled as topic #73 due to their association with the context of mitochondrion in the literature. The example clearly demonstrates the algorithm’s ability to extract concepts and provide information on the semantic context PHS 398/2590 (Rev. 09/04, Reissued 4/2006) Continuation Format Page


of the document. Based on such labeling, it is not difficult to recommend GO terms related to the concepts of mitochondrion and apoptosis.

Mitochondria[73] play a key part[160] in the regulation[113] of apoptosis[147] (cell[200] death[147]). Their intermembrane[73]

space[73] contains[131] several proteins[265] that are liberated[224] through the outer[73] membrane[219] in order[294] to participate[87] in the degradation[299] phase[209] of apoptosis[147]. Here we report[33] the identification[208] and cloning of an apoptosis-inducing[147 ] factor[19], AIF[147], which is sufficient[3] to induce[147] apoptosis[147] of isolated[76] nuclei[191]. AIF[147]

is a flavoprotein[73] of relative[122] molecular[177] mass[185] 57,000 which shares[168] homology[212] with the bacterial[213]

oxidoreductases[73]; it is normally[122] confined[123] to mitochondria[73] but translocates[166] to the nucleus[191] when apoptosis[147] is induced[147]. Recombinant[279] aif[147] causes[141] chromatin[51] condensation[279] in isolated[76] nuclei[191] and large-scale[41] fragmentation[174] of dna[126]. It induces[147] purified[213] mitochondria[73] to release[5] the apoptogenic[147]

proteins[265] cytochrome[73] c and caspase9[147]. Microinjection[217] of aif[147] into the cytoplasm[81] of intact[257] cells[200]

induces[147] condensation[279] of chromatin[51], dissipation[292] of the mitochondrial[73] transmembrane[206] potential[64], and exposure[280] of phosphatidylserine[68] in the plasma[219] membrane[219]. None of these effects[257] is prevented[147] by the wide-ranging[132] caspase[147] inhibitor[170] known[140] as zvad.fmk[172]. Overexpression[150] of bcl2[147], which controls[113] the opening[101] of mitochondrial[73] permeability transition[209] pores[191], prevents[147] the release[5] of aif[147] from the mitochondrion[73] but does not affect[257] its apoptogenic[147] activity[23]. These results[150] indicate[144] that aif[147] is a mitochondrial[73] effector[147] of apoptotic[147] cell[200] death[147].Figure 3. An example of the semantic analysis for an MEDLINE abstract (PMID 9989411). The topics of the words were labeled by the LDA model, shown as superscript number. The words form the topics #73 and #147 are highlighted. The abstract is associated with the following GO terms: (1) GO:0008630, DNA damage response, signal transduction resulting in induction of apoptosis; (2) GO:0009055, electron carrier activity; (3) GO:0005739, mitochondrion; and (4) GO:0006309, DNA fragmentation during apoptosis.

Table 1 Topic-GO Association MI as an Indicator Biological Relevance of the Topics. In the high MI pairs, the topic words are clearly all closely related to the corresponding GO terms.

Topic # GO ID MI GO

Category GO Term Most Frequent Topic Words

278 GO:0005730 0.001439 Component nucleolus ribosom rrna pre deplet process small nucleolar biogenesi accumul nucleolu

105 GO:0005816 0.00119 Component spindle pole body microtubul spindl mitot tubulin kinetochor mitosi centrosom pole centromer bodi

236 GO:0006935 0.00186 Process chemotaxis lymphocyt macrophag chemokin monocyt neutrophil inflammatory leukocyt peripher mcp cd8

156 GO:0006468 0.001514 Process protein amino acid phosphorylation

kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20

156 GO:0004674 0.001148 Function protein serine/threonine kinase activity

kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20

267 GO:0008248 0.001463 Function pre-mRNA splicing factor activity splice altern pre snrnp mrna spliceosom u2 step sap snrna

224 GO:0015671 5.05E-06 Process oxygen transport uniqu characterist featur extens character typic possess unusu exhibit

227 GO:0015213 5.00E-06 Function uridine transporter activity

function defin unknown perform wide thei tissu repress consist creat

Association of Semantic Topics with GO. Each document in our corpus is associated with one or more GO terms. Although the training of the LDA model was performed without using GO terms, availability of this information offered us an opportunity to study the correlation between the latent topics and the GO annotations. If the inference process of the LDA model simulates the first step of the human annotation—extracting semantic concepts from text; studying the relationship between the topic and GO annotation simulates the second step—mapping semantic concept to GO term. During the inference, the LDA model labels the topic for each word as shown in Figure 3. Labeling words not only provides the information on the semantic context of the document, but also allows us to tease out the topic-GO relationship by studying the many-to-many instances of association between the multiple topics and GO annotations. In comparison, the single class classification approach works at the document level and would fail to explain the multiple GO annotations associated with the text. The correlation between the latent topics and GO terms can be quantified by the mutual information (MI). Mutual information is a symmetric, non-negative quantity that measures the



relevance (amount of information) of one variable with respect to another variable, and it equals zero if and only if the variables are statistically independent. The MI between a latent topic and a GO term was calculated as follows:

where I(Ag; Lt) is the mutual information between the annotation of a word with GO term g and labeling the word with topic t; Ag and Lt are binary variables indicating whether a word is annotated with the GO term g and assigned to the topic t, respectively. The joint and marginal probabilities in the equation were estimated empirically by counting events. We calculated mutual information, I(Ag; Lt), for all topic-GO pairs. The Table 1 shows the examples of some topic-GO associations. The top 6 rows are the topic-GO associations with high MI values, while the bottom two rows are examples of associations with low MI. Overall, when the MI values of the associations were high, the GO terms match the semantic content of the latent topics very well and the low MI value topic-GO associations semantically mismatch. Note that MI is a quantity that is not normalized and numeric value is sensitive to the size of the data, which determines the value of empirical probability, therefore the absolute MI value is sample dependent. C.5. Sparse Training Cases Makes Automatic Annotation Difficult Our preliminary experiments of learning automatic annotation with classification approach were not satisfactory in term of accuracy (unpublished data). We further investigated the reasons for such difficulty. The Gene Ontology consortium adopts a principle of annotating a protein with GO terms as specific as possible. This leads to an interesting phenomenon as shown in Figure 4. Here, we plot the histogram of the number of associated MEDLINE documents per GO term in the GOA corpus. As we can see that, out of 6,565 unique GO terms, more than 2,000 GO terms have only one record associated with them and over 5,000 GO terms associate with 5 records or less. In the current training corpus, the GO terms are assigned by human curators, and the annotations are based on a thorough semantic understanding of the documents as well as other domain knowledge outside the documents, thus they tend to be very specific. It is this pursuit of specificity that leads to the sparse training data, which makes it difficult for most of machine learning methods to mimic human annotation. We understand that it is highly desirable to assign GO terms as specific as possible to describe protein functions. However pursuit of specificity leads to the sparseness of training data which renders automatic annotation extremely difficult if it is not impossible. First, it is very unreliable to learn predicting GO terms based on such sparse training data. Second, specific annotation usually requires thorough semantic understanding of a few specific statements, and understanding of such statement may require knowledge that is not present in the text. A good example of the latter is shown in Figure 3, where the abstract is annotated with GO term: GO:0008630, whose definition is “DNA damage response, signal transduction resulting in induction of apoptosis”. In fact, the words ‘DNA fragmentation’ were not explicitly mentioned in the abstract and the curator must have deduced the concept based on the sentence: “Microinjection of aif into the cytoplasm of intact cells induces condensation of chromatin.” At current stage, computational agents based on contemporary biomedical language processing


Table 2 Top GO terms associated with the Topic #147

GO ID MI Term Name

Top 5 GO

terms

GO:0006917 1.09e-3 induction of apoptosis

GO:0006915 1.00e-3 apoptosis

GO:0008632 2.81e-4 apoptotic program

GO:0006309 2.37e-4 DNA fragmentation during apoptosis

GO:0008637 1.03e-4 apoptotic mitochondrial changes

Topic #147

apoptosis, death, caspas, induc, apoptot, fa, program, surviv, ic, ced

0

1000

2000

3000

# Associated Docs Fr

eque

ncy

Figure 4. Histogram of number of associated MEDLINE records per GO term. Note that over 2,000 GO terms have only one record associated with them.


techniques can not achieve such inference and specific understanding of human language (Hirschman et al., 2005; Hunter and Cohen, 2006).

Another potential difficulty introduced by pursuing specific annotation is that documents with similar word (feature) composition may be labeled with different classes, which render these classes unseparatble from classification point of view. As an example, we list the high frequency words for topic #147 together with the top 5 biological process GO terms that have the strongest association with the topic in Table 2. Apparently, the LDA model has detected common presence of the words from the topic #147 in the documents associated with these GO terms, thus returned high MI for their associations. If these documents are labeled with different classes corresponding to the GO terms and used to train text classifiers, it equivalent to even a state of art classifier may not perform well because the . This is not only due to the sparseness of training cases when the documents are labeled with highly specific classes, but also because documents with similar words (unseparable features) are labeled with different classes which may confuse most learning algorithms.

C.6 Enhanced Text Classification with Semantic-Enriched Features To investigate the value of semantic analysis by the LDA model on text categorization tasks, we participated in the TREC 2005 Genomic Track evaluation (Lu et al., 2006; Zhai et al., 2005) and used this widely participated evaluation as a test platform. We are interested in this task because it is highly relevant to the automatic annotation task, which is commonly cast as classification problem (Hirschman et al., 2005). Another goal of participating the TREC establish a real world baseline performance of the state of art methods which can be used for the comparison to the methodology developed in this study. We initially concentrated on applying the support vector machine (SVM) algorithm, which is arguably one of the best text classifier (Lewis et al., 2004), and our results demonstrated that representing text documents with semantic-enriched features and data augmentation improve the text categorization performance by the SVM. The TREC 2005 training data consists of 5,837 full article texts from 3 journals. There are 4 text categorization subtasks: (1) allele (A) mutation, 338 positive cases; (2) gene expression (E), 81 positive cases; (3) gene ontology (G), 462 positive cases; and (4) tumor (T), 36 positive cases. It can be seen that some task have very few positive training cases, especially the E and T subtasks. Again, these tasks exemplify the difficulties commonly encounter during automatic annotation: high dimensionality and sparse training cases. We address the problem with probabilistic semantic analysis which significantly improves the performance of the state of art text classifier.

When used as an input for a text categorization algorithm, a text is commonly represented as a vector of the dimension of the vocabulary size, which we refer to as VocRep. A text in VocRep usually is a sparse vector in a high-dimensional space, where data points tend to spread far apart in the space. The ambiguities of natural language, i.e., polysemy and synonym, further complicate the situation in that two documents with similar semantic contents can be far separated in the space due to different preference of word usage. Learning under such scenarios is prone to overfitting problem. We tested the hypothesis that these problems can be


0.00

0.20

0.40

0.60

0.80

1.00

A E G T

Tasks

Rec

all

VocRep

SemRep

0.00

0.10

0.20

0.30

0.40

A E G T

Tasks

F-sc

ore

A B C

Figure 5. Effect of SemRep on and text classification Panel A shows the effect on recall; Panel B shows effect on F-score; and Panel C shows the effect on the Utility.

0.00

0.20

0.40

0.60

0.80

1.00

A E G T

Tasks

Util

ity


alleviated by representing the text document in the semantic topic space rather than in vocabulary space. We performed semantic analysis on the corpus with LDA and extracted 400 topics from the corpus. Then, each document was represented as vector of topics, a representation referred to as SemRep. Thus, a document is represented in a vector space with dimensionality of the semantic topics, in this case 400, where elements of the vector contain the number of words in the document belong to each topic. The widely used implementation of support vector machine, SVMlight

(Joachims, 1998), was used to train classifiers on the data in both forms of VocRep and SemRep, and the performances were evaluated using common information retrieval metrics: (a) recall, the percent of positive cases being retrieved; (b) precision, the percent of retrieved case are true positive; (c) the F-score, the harmonic mean of recall and precision; (d) utility, is the sum of utility scores for the true positive and false positive cases; and (e) the area under the receiver operating curve (ROC) (Bradley, 1997). Figure 5 shows that SemRep significantly improve the recall of the SVM on this specific data set, and some subtasks show more than a fold increases. In addition, Figure 6 shows the areas under ROC curve increased by 6.19%, 20.86%, 17.85%, and 13.56% respectively for the subtasks A, E, G and T. Thus, SemRep resulted in a cross-board improvement in the overall text classification performance of SVMlight. Text classification in the SemRep space has the following key advantages: (1) text is classified according to the semantic contents which, after all, is the goal of text categorization; (2) projecting text into semantic space allows the documents that share few common words but have same semantic content appear closer in the vector space, which increases the sensitivity of the classifier; (3) reduced dimension increase the generalizability of the trained classifiers.

We further address the sparse training case problem by performing data augmentation. The idea is to impute more potential positive cases into training data to train more robust classifiers. Many times, lack of positive training data is due to limitation of manual label training case. On the other hand, there are usually large amount unlabeled data potentially containing more positive cases which, if correctly identified, can be used to enhance training of classifiers. We adopted a semi-supervised learning technique (Zhu et al., 2003) to identify the potential positive cases from unlabeled data. Indeed, the approach improve the performance of SVM classification, especially the most significantly improvements are observed in the categories that has very few positive training cases (Lu et al., 2006). This experiment demonstrated that correctly grouping documents with similar features will enhance the classification performance in a scenario of sparse training data, which is exactly one difficulty facing automatic protein annotation effort.

C.7 Mixture language models for gene/protein information retrieval We also collaboratively participated in the ad hoc retrieval task of the 2005 Text Retrieval Conference (TREC) Genomics Track (Lu et al., 2006; Zhai et al., 2005) and took this opportunity to evaluate some of the proposed techniques for improving retrieval accuracy in retrieving biomedical abstracts/articles about a gene or a protein. In particular, we evaluated several variants of mixture language models and the results show that these advanced language models indeed help improve retrieval accuracy due to a more accurate modeling of the relevance of an article w.r.t. a query. Statistical language models have recently been shown to have many advantages over traditional retrieval methods including superior empirical performance and more solid foundation in statistics [Z5, Z6]. These preliminary results further confirm the effectiveness of using such approaches to retrieve biomedical literature.

The TREC 2005 Genomics Track ad hoc retrieval task is to retrieve relevant literature abstracts from a 10-year, 4.5-million document subset of the MEDLINE bibliographic database for 50 queries that reflect real information needs of biologists [Z4]. We submitted two official runs (UIUCgAuto, UIUCgInt), both using mixture



language models to expand a query with related words that occur frequently in the context of the query terms, an idea we will further explore in this project. In comparison with the 59 runs submitted by 32 groups to the ad hoc retrieval task, our results are quite impressive. Both of our runs are among the top 10 runs (6 th and 7th, respectively), and only three groups are above us [Z4].

In our post-TREC experiments, we have found that our query expansion method indeed outperforms the baseline language modeling method that does not perform query expansion and represents a query only based on a protein name. (See D.2.1 for explanation of this baseline method.) Table 3 shows a detailed comparison of the mixture model query expansion method and the baseline no-expansion method in terms of three retrieval measures: (1) Mean Average Precision (MAP), which measures the overall ranking accuracy; (2) Precision at top-10 documents ( Prec@10), which measures the how many relevant documents there are in the top 10 documents; and (3) The total number of relevant documents retrieved at top 1000 documents (RelRet). We see that the expansion language model method outperforms the baseline in all the three measures due to the incorporation of the additional words mined from the top-ranked documents and an improved weighting of the terms in an expanded query.

These preliminary results show that our idea of using a mixture model to expand a query with additional context words is effective for improving retrieval performance on biomedical literature data. In D 2.1, we further propose to apply such a model to improve the weighting of additional gene synonyms extracted from an external resource such as GeneRef.

C.8 Hidden Markov models for relevant passage extractionIn order to extract supporting text fragments from the literature for a predicted GO annotation, in D2.4, we propose to use hidden Markov models (HMMs) [Z3] to leverage the topics extracted with the LDA mixture model. In this subsection, we present some promising results of using a similar HMM to extract a relevant passage from a relevant document in retrieval. Specifically, the task here is to identify the most relevant part of a long relevant article to a query, and we would construct an HMM with one single topic state to model the relevant content and several background states to model non-relevant content. In our previous work, we evaluated such an HMM with some TREC data sets, and the results show that the HMM approach can effectively extract variable-length relevant passages [Z2].

In Table 4, we show some results of using a variant of this HMM (i.e., HMM-cd) to extract relevant passages from relevant documents in comparison with four different baseline methods (i.e., BL-s, BL-win, BL-cos, BL-pivoted) that represent several state of the art retrieval approaches. These baseline methods use a fixed-size sliding window and they mainly differ in how they measure the similarity of the query and the relevant passage. We compare the HMM method with these methods in terms of precision, recall, and F1 (a combination of precision and recall) on two different TREC data sets (i.e., DOE and HARD04) [Z1]. From Table 4, we see that the HMM method outperforms all the baseline methods by all the measures except for the precision on HARD04, in which the HMM method is just slightly worse than the best baseline method. These results show that our general idea of using HMMs to identify the most relevant supporting text fragments for a predicted GO annotation is reasonable and can be expected to be effective.

C.9 Summary


Table 3. Language Model Feedback on TREC 2005 Genomics Track Data

Method MAP prec@10 RelRet

Baseline Language Model 0.2415 0.382 3340

Expansion Language Model 0.2577 0.412 3476

Improvement +6.7% +7.9% +4.1%

Table 4. Effectiveness of HMM for relevant passage extractionCollection Methods Precision Recall F1

DOE

BL-s 0.869 0.591 0.632BL-win 0.779 0.777 0.730BL-cos 0.764 0.763 0.717BL-pivoted 0.749 0.745 0.701HMM-cd 0.941 0.858 0.862

HARD04

BL-s 0.670 0.909 0.666BL-win 0.668 0.759 0.621BL-cos 0.671 0.781 0.628BL-pivoted 0.672 0.783 0.629HMM-cd 0.671 0.969 0.706


In this section, we reported our preliminary results in several research areas that directly related to the specific aims of this proposal. Overall, these methods can be used to (1) extract biological concepts from biomedical literature; (2) associate identified biological concepts with controlled vocabulary; (3) enhance text categorization; (4) enhance retrieval of protein related text documents; and (5) identify supporting evidence for automatic annotation. These methods lay foundation for the methodologies proposed in this study, which will enhance the performance, therefore leading to better automatic annotation techniques.

D. RESEARCH DESIGNS AND METHODSD.1 SPECIFIC AIM 1: Identify/extract Descriptive Biological Concepts from MEDLINE Documents In this specific aim, we propose to develop tools to identify/extract common and descriptive biological concepts that reflect current knowledge of proteins and can be used for concise and consistent protein annotation. Annotating proteins with these concepts or their corresponding GO terms allows us to address the data sparseness and feature overlapping problems, thus laying a foundation for future automatic annotation system. To achieve these goals, we will (1) fine tune the LDA model to identify finer-grained biological concepts; (2) develop a mixture of LDA analyzer model to identify the domains of biomedical knowledge and the domain specific concepts; (3) use information bottleneck methods to extract informative concepts and a corresponding set of descriptive GO terms; (4) associate identified topics with representative GO terms using manual and automatic approaches.

D.1.1 Training CorporaGOA Corpus. We will continue to use the updated annotation data from the GOA projects. Since GOA is an ongoing project, we anticipate that there will be more well-annotated MEDLINE documents available during the period of this project, which we can exploit for training. This data set contains high quality human annotations based on biomedical literature. The key advantage of using this data set is that it provides examples of associations between GO terms and literature, which allows us to study the relationship between semantic topic and GO annotation. In addition, the corpus consists of literature that is known to be pertinent to proteins, thus the semantic concepts from this corpus are more relevant to our task. We will also create an enhanced GOA corpus, by retrieving more MEDLINE documents deemed highly relevant to the proteins in the GOA database using the information retrieval algorithm developed in Section D.3. This is of interest because, during a study of creating protein-semantic-topic network (Zheng and Lu, 2006), we have observed that the number of papers lined to each protein remains relatively small although many proteins are extensively studied and their function well known. We conjectured that annotators may only choose the most representative literature as the evidence of annotation. Such enhancement in retrieving document not only enriches the information content of the protein-related corpus, but also strengthens the confidence of automatic annotation by summarizing more references. We will also test using self-training algorithm to impute potentially GO labels for these additional records, so that they can be used for training of annotation algorithms.Large Sample of MEDLINE. The rationales for collecting and experimenting on a large MEDLIENE corpus are several folds: (1) developing and evaluating probabilistic topic model capable of identifying semantic topics across various domains of biomedical knowledge, e.g., the mixture of LDA analyzers model; (2) identifying representative protein-relevant semantic topics as the basis for protein annotation; and (3) training and testing the information retrieval algorithms developed in this project. These goals lay a foundation for the future development of an automatic protein annotation system, because they correspond to the critical steps of the automatic annotation process: retrieving the protein related documents; extracting the semantic contents of the document; and mapping them to controlled vocabularies. Initially, we will use a 10 year MEDLINE corpus from TREC 2005 Genomics Track. In addition, we have subscribed to the NLM PubMed database allowing us to download more the MEDLINE documents if needed. Full Text from PubMed Central. Our current experiments were performed with MEDLINE title and abstract only. As pointed out by Ray and Craven (Ray and Craven, 2005), the MEDLINE titles and abstracts from the GOA corpus are weak training data in that some concepts used for annotation may not be observed in the title



and abstract but in the full text body. Therefore, we will also experiment with the full text documents as training data and test the hypothesis that full text semantic analysis will improve automatic annotation due to richer information contained in the text. The full text documents will be used for two main purposes: (1) training annotation algorithms described in the Section D.2; and (2) training information retrieval, document segmentation and evidence extract algorithms described in the Section D.3. For training annotation algorithm, we will create a full text corpus by retrieving available journal articles corresponding to the MEDLINE records in the GOA corpus from the PubMed Central (http://www.pubmedcentral.nih.gov/) and e-journals subscribed by our institute. Thus, this corpus will have all necessary information, i.e., free text and GO annotation, for training of annotation algorithms. For training of information retrieval algorithms, we will initially utilize the TREC 2006 corpus, which contains a large number full text articles from 49 journals (TREC Genomics, http://ir.ohsu.edu/genomics/). D.1.2 Extract Descriptive Biological Concepts by Fine Tuning LDA ModelsRationale. The goal of this sub-task is to fine tune the LDA model in order to extract a set of descriptive biological concepts. We refer to such a set of concepts as descriptive because, due to the generative nature of the LDA model, the topics/concepts extracted by the LDA-based model are the recurring common concepts used to describe the function of proteins. Thus, these topics concisely represent the major concepts within the corpus. This idea corroborates a similar practice by the GO Consortium: use a set (about 200) of hand-picked high-level concepts GO terms, referred to as “slim” GO terms, to represent the important areas of biological knowledge on proteins (GO Slim, http://www.geneontology.org/GO.slims.shtml). We believe that the descriptive concepts from LDA-based models are potentially more natural representation of our knowledge than the hand-picked GO Slim terms, in that the biological topics extracted by LDA-based model reflect recurring themes of corpus.

Experiments. We will investigate the effect of the fine-tuning the parameters of the LDA model trained with larger corpus or alternative prior of parameters on the quality of identified semantic topics and evaluate models as follows:1. Increase the training corpus size. The assumption behind this approach is that a larger corpus allows us to

extract more specific concepts without overfitting the data. For this experiment, we will use the up to date GOA corpus and/or the enhanced GOA corpus to see if finer grained topics can be extracted and modeled. We anticipate that, as the size of the training corpus increases, the overfitting problem will become less of an issue. Under such conditions, parameter settings that prefer finer granularity will generate more specific semantic concepts than the current ones. Note that with parameters fixed, Bayesian model selection will effectively avoid overfitting.

2. Incorporate newly developed algorithm. Recently, Blei and Lafferty have reported an enhancement over the LDA model referred to as Correlated LDA model (Blei and Lafferty, 2005). In this model, the original Dirichlet prior distribution for the topic content variable in the LDA model is replaced by a logistic normal distribution such that the correlation of the topic contents of documents is more effectively captured. The report indicates that the modification enhances capability of identification of topics, thus the methods adds another dimension for improving semantic analysis of biomedical literature.

Evaluations. We will evaluate the results of the experiments as follows: 1. Evaluating goodness-of-fit. Since this experiment is related to language modeling, we will use a well

established metric, perplexity, to evaluate the goodness-of-fitting of a model and compare different models. Perplexity is a measure in information theory which is equivalent to an inversed geometric mean of likelihood per word, and generally a lower perplexity value indicates better generalization of a model (Blei et al., 2003; Manning and Schutze, 1999). During evaluation, we will train models on a training set and calculate the perplexity of the model on held out test set.

2. Evaluating biological relevance of the concepts. For the concepts extracted from the GOA corpus, we will evaluate their biological relevance according to the methods described in our previous report (Zheng et al., 2006), namely through both manual inspection and evaluating the mutual information of the topics with annotated GO terms. Note that since the concepts extracted by the LDA model are readily understandable to biologists, it is possible to manually inspect the topics (few hundreds) manually, see section D.1.5 for detailed plan of manual evaluation.


http://www.geneontology.org/GO.slims.shtml

http://ir.ohsu.edu/genomics/

http://www.pubmedcentral.nih.gov/


3. Assigning representative GO term to concepts. For the concepts deemed biologically relevant, we will assign a most representative GO term determined by both manual annotation or employing the GO categorizer algorithm (Joslyn et al., 2004). Given a set (cluster) of GO terms associated with a concept, the GO categorizer algorithms returns an ordered list of these GO terms in term of their capability of representing overall semantic meanings of the set by studying the relationship among the GO terms on the GO graph and employing on discrete mathematics of finite partially ordered sets to order the GO terms. The current stand-alone version of the program is publicly available.

In summary, these experiments will enable us to evaluate the LDA models trained with different parameter settings and identify biologically relevant concepts. We expect to identify a set of descriptive biological concepts from the GOA corpus that are relevant to protein function and useful for efficient annotation. D.1.3 Extract Biological Concepts with a Mixture of LDA AnalyzersRationale. The goal of this subtask is to extend the LDA model to a mixture of LDA analyzers, to identify the domains of biomedical knowledge and domain-specific concepts. This algorithm captures the statistical structure within a corpus at both the word and document levels, through grouping words into semantic topics and clustering documents according to their topic contents. Biomedical literature from different biomedical fields constitutes domains of biomedical knowledge, and each domain may contain a set of common semantic concepts. For example, the dental medicine literature may have a distinct set of concepts than those in the molecular biology literature, and very often the topics from different domains do not overlap. To deal with the diversity of concepts in a manual annotation setting, a natural approach is to assign experts specialized in different domains to index the documents from the domain of their expertise. This can be simulated by the statistical model of a mixture of LDA analyzers, in which each individual LDA analyzer fit a collection of documents from a specific domain. The advantage of the methods is that processes of capturing semantic topic and clustering documents are interleaved, thus the information at both word and document levels are conserved. Note that the Dirichlet prior of a flat LDA model would assign a non-zero prior probability for every topic to a document. Thus, a flat LDA will entertain the possibility that all topics exist in a document, even those from different domains, and assign words to them. This will lead to deteriorated performance when the number of topics becomes large. The mixture of LDA analyzers model alleviates this problem by grouping documents with similar topic contents into clusters, and modeling documents within a cluster with relatively small number related topics. Overall topical diversity is captured by different LDA ananlyzers in the mixture model. The approach has been successfully applied to various machine learning problems (Bishop and . 2000; Ghahramani and Beal, 1999; Ueda and Ghahramani, 2002). Model specifications and statistical inference. The inference algorithm for the mixture of LDA analyzers in text mining has not been reported previously. Here, we specify the model and briefly discuss our inference algorithm for the model. As shown in Figure 7, the model looks quite similar to the flat LDA model except for addition variables and c, and multiple copies of the word distribution parameter . More specifically, we add a cluster indicator variable c is added to each document indicating which analyzer generated the document; a new cluster prior distribution,

, which is a parameter vector for a multinomial distribution; finally, the topic word distribution consists of a dimensional matrix. The other variables, w, z, q are defined similarly as in the flat LDA model. Intuitively, the process of “generating” a document by a mixture of LDA analyzers model is as follows: (1) select a domain/cluster c which the paper belongs to; (2) sample topic distribution within this domain; (3) for each word wn, choose a topic zn

according to topic distribution ; and generate a word w according word distribution z(c).

We have derived the variational inference algorithm for the mixture of LDA model, as shown in Appendix 2. The inference algorithm for this model is very similar to that for the flat LDA model (Blei et al., 2003), except we need to estimate which domain is likely to generate the observed document. We have extensive experience with the variational methods (Lu et al., 2004a), and we do not anticipate major difficulty with implementation.


Figure 7 Graphic representation of the mixture of LDA analyzer model. Note that there are K copies of the word distribution .

|C|

Nd

wz

T K


Data. We will apply the mixture of LDA analyzers on both the GOA corpus and a large collection of 10 year MEDLINE corpus. For the GOA corpus, we will test if the mixture of LDA analyzers can capture biological concepts more specific than the flat LDA. As for the 10 year MEDLINE corpus, we will test the feasibility of modeling the documents across full spectrum of biomedical knowledge, which may provide impacts on the automatic annotation tasks beyond the protein annotation, e.g., the Automatic Indexing Initiative of NLM. Model Training. As discussed in section D.1.2, we will employ both Bayesian model selection and cross validation techniques to perform model selection. Like the Gibbs sampling approach, the variational method can also approximate the model evidence p(w | Mi), which has been successfully applied for model selection in several applications including our previous work (Bishop and . 2000; Ghahramani and Beal, 1999; Lu et al., 2004a). The approximate evidence can be used for model selection as discussed in (Zheng et al., 2006). We have derived the algorithm and we are experienced scientific programmers who have already implemented a basic LDA model. Therefore implementing the algorithm in C/C++ will be straightforward at the current stage. Evaluation. We will evaluate and compare the performance of the mixture of LDA analyzer model with that of the flat LDA model from the following aspects: 1. Comparing the goodness-of-fit. We will first compare the goodness-of-fits of the flat LDA and the mixture

of LDA analyzers with the same total number of topics using perplexity as the metric. This comparison will demonstrate which model fit the data better. Such a comparison can readily achieve by plotting the perplexity from the models with the same number of total topics.

2. Evaluating correct document cluster assignment. For the experiment on the 10 year MEDLINE corpus, we will determine if the algorithm can correctly group documents from different domains. This can be achieved by checking the journal description indices (Humphrey et al., 2000) of the clusters of the documents to see if any type of journal (domain) index is significantly enriched within the clusters. The Journal Index is a subset of MeSH terms used to index the publication type by NLM (Humphrey et al., 2000). We will use Kappa-coefficient or chi-square statistics (DeGroot and Schervish, 2002) to evaluate the statistical significance of cluster assignment.

3. Evaluating the biological relevance of extracted concepts. When a model is trained with the GOA corpus, the biological/protein relevance of the topics can also be determined by their mutual information with GO terms, as demonstrated in our previous study (Zheng et al., 2006). We will also manually inspect the extracted concepts from this corpus and assign biological relevance scores, see Section D.1.5 for details. With scores assigned to the concepts from the both flat and mixture of LDA available, we will compare the distribution of the scores to see if the scores for the mixture of LDA is better than those of flat LDA.

Potential pitfalls and alternative approaches. (1) When modeling 10 year MEDLINE corpus, computational time may become a concern. Fortunately, the variational methods can be readily parallelized and we have requested the Dell PowerEdge Server with multiple CPUs and large memory to address such difficulty. We are familiar with Matlab MPI (http://www.ll.mit.edu/MatlabMPI/) interface to implement a parallelized version of program for this intensive computation. (2) Another potential problem is that the models may be trapped at the local maxima, which can be addressed by multiple random starts. We can also implement a Gibbs sampling version of the mixture of LDA analyzer model to overcome potential local maxima problem. With the requested multiple CPU computer, we should be able to run the Gibbs sampling experiments within reasonable time on the GOA corpus, which has a relatively small size in comparison to the 10 year MEDLINE.

D.1.4 Identify Informative Concepts Using Information Bottleneck MethodsRationale. Our task of learning automatic annotation based on classification and/or semantic concepts is confronted with two major difficulties. First, the documents associated some GO terms are not statistically distinguishable due to the fact that these GO terms may share very similar semantic meaning. Furthermore, semantic similarity of GO terms potentially introduces human annotation inconsistencies, in which two articles with very similar semantic content can be annotated with semantically close yet different GO terms. To increase consistency and accuracy of an annotation system, a sensible approach is to merge the semantically similar GO terms. Second, all concepts identified by the LDA model are not biologically relevant. This is due to the fact that the LDA-based models are generative models and the training of the LDA model is unsupervised without utilizing available GO annotation information. To address these difficulties, we propose to apply the information bottleneck (IB) (Tishby et al., 1999) framework to perform double clustering: (a) cluster



words into topics maintaining information to the associated GO term; and (b) group GO terms to clusters that maintain information with respect to topics. The double cluster approach will provide us a set of descriptive biological concepts and a set of descriptive GO terms matching those concepts in a unified framework . Information Bottleneck Methods. The IB method (Tishby et al., 1999) provides a general approach to identify the statistical structures from data with respect to some relevance features based on information theory. Most of the successful studies utilizing IB methods are related to text mining (Friedman et al., 2001; Slonim, 2003; Slonim et al., 2001b; Slonim and Tishby, 1999; Slonim and Tishby, 2000; Slonim and Weiss, 2002). In the IB methods, the task of finding statistical structure (patterns) within data is formulated as a compression task, which can be achieved by partitioning the data X into clusters C, with |C| < |X|. In our case, we want to obtain a soft partition of words into clusters, equivalent to the word distributions representing topics in the LDA model. Furthermore, the partition is constrained such that the original mutual information between data X and a relevance variable Y, I(X; Y), is retained as much as possible by the new representation, in the form of I(C ; Y). More specifically, let be a set of D discrete variables; let be a partition of the variable X, where K < D are cardinalities of the partition groups and original data respectively; let

be a set of discrete relevance variable associated with instances of X. The IB method seeks a soft (stochastic) partition of X, parameterized by , so that it satisfies following objective:

where is the Lagrangian functional with respect to ; and are the mutual information among the variables; and is a Lagrangian multiplier that serves as the tradeoff parameter for the compression and retained information. Intuitively the equation can be explained as follows: the first term on the right side reflects the degree of compression of X, where the smaller the value of mutual information I(C; X) the denser the compression; and our goal is to minimize this quantity—compress X as much as possible. The second term is amount of information with respect to Y that is retained by replacing X with C, we want to maximize this quantity by minimizing its negation. The key idea of IB is that, as long as the information of X with respect to Y is preserved by C, we will compress X as much as possible. Double Clustering of GO Terms and Words. The IB principle also applies in the multivariate scenario, where one can compress multiple variables while retain the information within the system (Elidan and Friedman, 2005; Friedman et al., 2001; Slonim et al., 2001b). In our case, we would like to group words into concepts and group semantically similar GO terms into GO clusters—a process of double clustering. Due to its explicit constraint, the double clustering by the IB will theoretically produce very desirable results: (1) the concepts identified in this manner will be more biologically relevant because they are constrained to retain information with respect to the GO terms; (2) the GO term clusters will maintain semantic information with respect to the extracted concepts—the merged GO terms are the ones with similar semantic meanings, thus reducing the annotation discrepancy. With the reduced target GO set, the mapping from the concepts to GO clusters will be more consistent and will be less likely to overfit the data, therefore the annotation of protein will potentially be more accurate and consistent. Methods. The multivariate IB framework uses Bayesian networks to specify how to respectively compress and preserve information in the system (Elidan and Friedman, 2005; Friedman et al., 2001; Slonim et al., 2001b). For our double clustering task, we define the Bayesian networks as shown in Figure 8. In this figure, there are two directed acyclic graphs (DAG), namely Gin and Gout, which specify how to compress and retain information of the system respectively. The Gin specifies how we want our original data, word w and GO term g to be compressed. As shown in the graph, the word w is compressed/partitioned into the word clusters cw and GO term g is compressed into the GO clusters cg. The Gout specifies how information is to be retained by the compressed variables with respect to the original system. In this graph, we specify the word cluster cw as the


w g

cw cg

w g

cw cg

Figure 8 The DAG representation of information flow in the double clustering IB system. The shaded nodes represent observed variables.

Gin Gout


parent for the word w and GO cluster cg, indicating we want to preserve (maximize) the information among these variables; furthermore, we also specify the directed edge between the GO cluster cg and the observed GO terms g, indicating we would retain this information. Thus, the graph Gout clearly spells out the fundamental assumption that underlies this proposal: the semantic concepts ( c w) determine the observed words ( w ) and the general GO category ( c g) of a document, and the latter in turn determines the actual observed GO annotation ( g ) ! The multivariate IB method seeks to compress information contained in the Gin while maximizing the information contained in the Gout. Formally, the objective is to minimize the Lagrangian functional:

In equation , denotes the total mutual information among the variables of the graph G and is the mutual information of a variable Xi with its parents, Pa(Xi). The optimization can be achieved by bottom-up (agglomerative) and top-down (soft/stochastic partition) and a sequential multivariate IB algorithm which is a variant of the agglomerated algorithm approaches (Friedman et al., 2001; Slonim, 2003; Slonim et al., 2001a). We are well versed in the Bayesian network and information theory, and do not anticipate significant difficulties in implementing the algorithms. Experiments and evaluations. We will perform double clustering experiment on the GOA corpus to cluster words and GO terms. Data preprocessing will be same with that of the LDA experiments. We will first perform agglomerated IB double clustering because the results are relative easy to evaluate, especially for the clustering of GO terms. The algorithm will return a hierarchical tree of clustered GO terms and a collection of word-usage distributions corresponding to the topics in the LDA model. The returned results which will be evaluated with the following approaches: (1) manually evaluation of biological relevance of the topics and assign representative GO terms, see Section D.1.5 for detail; (2) identify the representative GO term among the clustered GO terms using the GO Categorizer algorithm. Potential pitfalls and alternative approaches. There are several alternative inference algorithms for the multivariate IB methods, with complementing characteristics. Like most greedy hill climbing algorithms, the IB methods can be trapped at some local maxima, especially the sequential multivariate IB algorithm (Slonim, 2003), which will be addressed by multiple random initialization of the parameters. D.1.5 Manual Evaluation of Identified Semantic Concepts. Rationale. The overall hypothesis for the Specific Aim 1 is that the proposed methodologies enable us to identify more biologically coherent and specific semantic concepts. Testing of this hypothesis requires rigorous evaluation of the semantic concepts extracted by these methods and statistical comparison.

Methods. We will form an evaluation team consisting of PI and Drs. Ashley Cowart, and David McLean to perform evaluation of the topics. Drs. Cowart and McLean are sophisticated GO users in term of applying GO annotation in various biological research setting, e.g., microarray data analysis. We will inspect the topics returned by various algorithms, assess their biological relevance and assign a score, and finally assign a suitable GO term to annotate each concept, either by choosing from the candidate GO terms returned by the algorithms or assigning a GO term from the GO graph. To ensure the objectiveness of evaluation, we will perform blind evaluation. This can be achieved by implementing a simple web-based application to randomly choose a topic from the collection of the topics returned by various algorithms. The top 50 words and candidate GO terms determined by either MI association or IB methods will be presented to the reviewers. The web application will keep track of assigned biological relevance scores and the representative GO terms selected by annotators. Each topic will be reviewed by at least 2 annotators so that the scores will be averaged and the disagreement in GO assignment will be resolved among the annotators.

Evaluation. With scores assigned, we will study the distributions of topics scores produced by different algorithms. To compare the qualities of the topics by different algorithms, we will statistically compare the score distributions with Kolmogorov-Smirnov test (DeGroot and Schervish, 2002), to determine whether distributions produced by different algorithms are significantly different.

D.1.6 Re-annotation GOA Corpus with Descriptive GO Terms and Baseline ClassificationRationale. Manual inspection and GO categorizer avail us a collection of representative GO terms associated with biological concepts. This allows us to re-annotate the GOA train data set with these descriptive GO terms.



The key advantage of this approach is that it will address the data sparseness and the potential annotation inconsistency problems of current GOA corpus. We will test the hypothesis that such re-annotation improves accuracy and consistency of automatic annotation either by conventional text categorization or by the automatic annotation methods proposed in next section. Methods. In this experiment, we will replace the original GO terms in the GOA corpus with their corresponding descriptive GO term. Then, the data set with new GO annotation will be used as training set to train binary classifiers to classify/annotated documents. Note that, since the number of biological topics is at the range of few hundreds to thousand, this classification task will be much more complex than the ones in TREC 2005 and more closer to the real word annotation needs. We will use both naïve Bayes (Nigam, 1998) and SVM the text classification algorithms to establish baseline for classification performance. We will train two sets of binary classifiers, one trained with the original GO labels and another with the re-annotated GO as class labels. Then, the trained classifiers will be applied to a same left-out test set documents. Evaluations. We will keep track recall, precision, F-value, the overall classification error rate, and ROC curve analyses results for each set of classifiers. Since most of these metrics can be treated as a parameter of binomial distribution, e.g., recall is the rate of correctly assigned annotation out of all original annotations, the significance of the metrics from two experiments can be determined using the standard statistical comparison of two binomial distribution parameters (DeGroot and Schervish, 2002). Furthermore, the performance of these classifier sets can be used as baseline to compare the performance of the algorithms to be developed in the next section.Potential pitfalls and alternative approaches. The obvious limitation of this experiment is the computational cost. If original GO annotation contains a total of G annotations, one needs to train G classifiers and apply all of them onto each test case to perform annotation, which apparently does not scale well. For the purpose of evaluation, we will only train classifiers for a subset of observed GO terms in the GOA corpus.

D.2 SPECIFIC AIM 2: Develop Algorithms for Automatic Annotation In this section, we will study several approaches for automatic annotation, i.e., predicting GO terms based on free text. Note these methods potentially have broader impact beyond GO annotation, e.g., predicting MeSH terms. A characteristic of our approaches is that they are probabilistic approaches with sound statistical foundations, which allows them to deal with the uncertainty in the natural language text more robustly than rule-based methods.D.2.1 Query expansion and mixture language models for literature retrieval Rationale. The first technical component in the proposed annotation process is to retrieve relevant literature abstracts or articles about a given protein. Extensive research in information retrieval has shown that retrieval accuracy is highly affected by two factors [Z7]: (1) what terms to be included in the query and (2) how to assign weights to these terms. Reports of many TREC Genomics Track participants have also reported the significant influence of term weighting on the retrieval performance [Z8, Z9, Z10]. Recent studies in information retrieval have shown that statistical language models have significant advantages over traditional retrieval approaches in terms of optimizing term weighting and optimizing parameters [Z11,Z5, Z6]. In this experiment we will develop new statistical language modeling techniques to improve the accuracy in retrieving biomedical literature abstracts/articles about a protein through expanding a query with additional useful terms and using probabilistic mixture models to assign optimal weights to all the query terms. Given a short query with a protein name, if we match the query directly with documents, we may miss many relevant documents that might mention the protein but using a different synonym. In order to increase the recall, which is critical for improving the annotation accuracy, we propose to expand the query in two ways: (1) Synonym Expansion: We will expand the query with synonyms from resources such as GeneRef. (2) Related Context Word Expansion: We will also expand a query with related words mined from the top-ranked retrieved articles based on co-occurrences. Although both kinds of expansions have been explored in some existing work, notably work in the context of TREC Genomics Track [Z8,Z9,Z10], existing work has not been able to adequately address the important issue of term weighting, which is especially important when we add extra terms to the query; indeed, without appropriate weighting, adding terms such as gene synonyms does not always help improve retrieval accuracy [Z4]. To illustrate the weighting problem, consider a query such as PHS 398/2590 (Rev. 09/04, Reissued 4/2006) Continuation Format Page


“activating transcription factor 2”. Suppose we expand it with the symbol “ATF2”. If we pool the new term “ATF2” together with the original query words as done in most existing approaches, we would be treating “ATF2” equally as any other query word such as “factor”. Intuitively, however, we would like to give ATF2 more weight than “factor” because matching “ATF2” is equivalent to matching the whole phrase “activating transcription factor 2”. In general, since the original query words are more important than the expanded words and not all the introduced new terms are equally reliable, assigning appropriate weights to the terms in the expanded query is extremely important and affects accuracy significantly. We propose mixture language models to learn optimal weights from top-ranked documents. A major innovation of our methods is a principled way of combining the original terms, expanded gene synonyms, and related context words mined from the literature and optimizing their weights with statistical estimation. As shown in the preliminary results (see C7), such an approach can be expected to effectively improve retrieval accuracy. Basic Retrieval Method. Our basic retrieval method is the Kullback-Leibler divergence retrieval method developed by co-investigator Dr. Zhai, as part of his dissertation research (Zhai, 2002). It has been shown to be quite effective in several different tasks (Zhai and Lafferty, 2001a; Zhai and Lafferty, 2001b; Zhai and Lafferty, 2002; Zhai and Lafferty, 2004; Zhai et al., 2003) and allows us to combine the two kinds of expansions naturally and model terms in an expanded query accurately as we will discuss later. The basic idea of this approach is to estimate two language models D andQ for document D and query Q, respectively, and then score D w.r.t. Q using the KL-divergence of the two models (Zhai and Lafferty, 2001a). We will use the standard Dirichlet prior smoothing method (Zhai and Lafferty, 2001b) to estimate D, but will propose new methods for estimating Q to achieve two ways to expand a query: synonym expansion and related context word expansion. Existing studies of this family of models have shown that the accuracy of Q affects retrieval accuracy significantly [Z5,Z6,Z12]. Our own previous work on TREC Genomics retrieval tasks has also confirmed the importance of estimating an accurateQ, which can accurately model what information to retrieve (Zhai et al. 2003) [Z17].

Synonym Expansion and Related Context Word Expansion. Since the mention of a gene/protein in the literature often has many variations, our first idea is to leverage the external resources such as GeneRef to expand a query with possible synonyms. As explained earlier, a major challenge in expanding a query in this way is to assign appropriate weights to the query terms so that all the terms will contribute to the retrieval score in an optimal way. To solve this problem, we allow each query word to have a distinct weight and use top-ranked documents to help assign appropriate weights. Intuitively, the top retrieved documents are reasonable approximations of relevant documents, thus we can use the word distribution in such pseudo-relevant documents to help us estimate weights for the query terms. Specifically, if a query term is indeed important, we would expect the term to have a relatively high frequency in these feedback documents, while if a term is not important, we would expect it to have a relatively low frequency. In addition to the synonyms of gene names, many related words that tend to occur in the relevant context of the query can also be very useful for retrieving additional relevant documents that may not match any gene names. We thus further propose to further expand the query with related context words through exploiting co-occurrences in the top-ranked documents. The basic idea is to mine terms that co-occur with query terms in these top-ranked documents and incorporate these terms into the query. These related terms can be expected to help retrieve more documents that may be relevant to the query. Thus our overall method is to expand a query with synonyms and related context words that co-occur with query words and assign optimal weights to all the terms through fitting a mixture model (described in detail below) to the top-ranked documents. As shown in the preliminary results, the mixture model approach is effective in introducing useful new terms and assigning optimal weights to terms.

Mixture Language Models. To achieve optimal weighting of all the query terms and incorporate related context words simultaneously, we assume that the top-k documents are generated using a mixture model with two components (for relevant content) and B (for non-relevant content). Since most documents in the collection are non-relevant, we use the whole collection to estimate and fix it. Suppose F={D1, …, Dn} are the top n documents from an initial retrieval (e.g., using the original unexpanded query). The log-likelihood of F given our mixture model is


1

log ( | ) ( , ) log[ ( | ) (1 ) ( | )]n

Q i Q Bi w V

p F c w D p w p w


where ( , )ic w D is the count of word w in document Di, and is the mixture weight that indicates the amount of noise that we would like to set. In our previous work, such a mixture model has been used to update a simple query language model to improve retrieval accuracy (Zhai & Lafferty 01c). Here we will impose appropriate priors on Q and use the maximum a posterior (MAP) estimator [Z16] to get an estimate of Q that can indicate optimal weights of terms (i.e., p(w|Q ) indicates the weight on word w). This allows us to achieve our goal of optimizing weighting of terms introduced through synonym expansion and related context word expansion simultaneously. Specifically, we can now use two different priors to perform the two kinds of query expansion we would like to achieve. First, with a prior that requires Q to put all the probability mass on the query words (thus assign zero probability to all non-query words), our MAP estimate of Q would give a higher weight to a query term if the query term is common in the F but rare in the whole collection (i.e., small probability according to the background model ), which is reasonable. Second, to further expand any given query model Q with related context words from F, we may use Q to define a conjugate prior (i.e., a Dirichlet prior) forQ, parameterized as , where is a confidence parameter for the prior. This prior would favor a candidate query language model that is close to Q. With this prior, the MAP estimate of Q would give us an interpolated query language model which combines

Q. with an additional word distribution learned from the

feedback documents F. The parameter controls the weight on Q (i.e., the synonym-expanded query) and can be set through cross-validation. Clearly, the new estimate of Q effectively combines the original query words, the expanded synonyms and additional related context words mined from the top-ranked documents. The weights of all these terms are determined through statistical estimation.

Experiments and Evaluation. We will use the TREC 2003 and 2004 Genomics test collections to evaluate our query expansion methods to leverage the available resources [Z13,Z14]. The TREC 2003 and 2004 Genomics Track tasks are essentially the same as our task of retrieving relevant literature information about a protein. We will index these data with the Lemur information retrieval toolkit (http://www.lemurproject.org/) , on which we will implement our retrieval methods. We will compare baseline runs with the proposed methods using primarily two standard retrieval measures: (1) Mean Average Precision (MAP): This measure is computed as the average precision at each rank when we have a new relevant document retrieval, and thus reflects well the overall ranking accuracy. (2) Precision at top N documents (e.g., N=10, N=100). This measure directly gives us an idea about how many relevant documents are there in the top N retrieved results. The overall performance of a method will be computed by averaging the performance on all the topics to be tested.

We will test two major hypotheses: (1) Adding weighted synonyms to the query improves retrieval accuracy over the original query. This hypothesis can be tested by comparing the estimated expanded query language model Q with the baseline query model estimated based on a protein name. We will also further analyze the benefit of estimating the weight of a query term by using an additional baseline expanded query language model that uses uniform weighting; this baseline represents the state of the art approaches that assign uniform importance weights to all the terms. Based on our TREC 2003 experiments (Zhai et al. 2003), we expect to the weight estimation based on top-ranked documents to be effective. (2) Adding weighted additional context words to the query further improves retrieval accuracy over a synonym expanded query. This hypothesis can be tested by comparing with the further estimated query language model with as the prior. As shown in the preliminary results C7, such mixture models have proven to be effective in the TREC 2005 Genomics Track ad hoc retrieval task. We thus expect to see positive results in the proposed experiments as well. We will vary the number of top-ranked documents to be used in the experiments to test the robustness of our methods.

Potential pitfalls and alternative approaches. Based on our experiences with participating in TREC 2003 and TREC 2005 Genomics Track, we expect that the proposed two query expansion strategies will both be effective. One potential problem of the proposed mixture model is that we need to set a parameter , which indicates our relative weight on the synonym expanded query in comparison with the new terms learned from the top-ranked documents. Although we may set this parameter through cross validation based on some training queries, the setting may not be optimal for the current query. Thus to further improve the performance,


http://www.lemurproject.org/


we will also study how to set this parameter based on the current query. For example, we may adopt a more robust estimation method that the Co-PI has recently developed for estimating similar mixture models for retrieval [Z12]; the main idea is to start with a high confidence on the query terms and gradually reduce the confidence as we pick up more terms from the feedback documents. This method has been shown to have the potential for removing the confidence parameter completely [Z12].

D.2.2 Map the Biological Concepts to GO Terms with Multivariate IB. Rationale. In this experiment we will develop an algorithm to assign GO terms from the most descriptive GO set learned in previous section to documents. As discussed before, the advantage of the IB method is described in Section D.1.4. Note that the IB method by itself is not an inference engine, thus it can not be used to infer the biological concepts from a new document. We will develop algorithms that combine the results from the IB methods with statistical learning approaches to estimate missing parameters and perform such inference. Methods. Given a document, we will extract concepts (word clusters) exist within the document and assign matching GO from the descriptive GO set to the document. The information flow and probabilistic relationship between the variables is represented as a DAG in Figure 9. To predict a GO cluster membership cg to a document conditioning on the observed words w, we can execute the following probabilistic query:

Note that the first term on the right side of Equation is available from the IB method. The key parameter that we need, but is missing from the results of the IB is the probability , the parameter associated with the arc between cw and cg in Figure 9. Fortunately, the necessary information for learning this parameter is available during the training of the IB methods. Here, we propose using an EM like algorithm to estimate the missing parameter, so that it can be used to automatically assign a GO cluster membership to documents according to Equation . The basic idea is to treat the word cluster cw and the GO cluster cg as two latent variables as shown in graph Gout in Figure 8. After training, the IB double clustering returns two optimal distributions and . With these two distributions and conditioning on observed words w and GO term g in the training data, we can estimate the expected instantiation of variables cw and cg . This corresponds to the E step of the EM algorithm. Then, we can estimate the parameter using these expected instances of the variables, according to the maximal likelihood principle. This corresponds to the M step of the EM algorithm. Now, combining the estimated parameter with from IB, we can answer the query in the equation .

Experiments. We will use the GOA corpus to train IB methods. After double clustering of the corpus, each document will be annotated with the GO cluster membership or the descriptive GO terms. We will divide the data into training and test sets, and estimate the parameters from the training set. Then, we will apply the learned parameters to assign GO terms to the test set documents. We will evaluate recall, precision, F-measure, and ROC curves for this approach. We will compare these performance metrics with those from the classifier sets described in the Section D.1.6. The implementation of such an algorithm is straightforward given our previous experience with implementing similar models.Potential pitfalls and alternative approaches. The estimation of the parameters of the network is based on the maximal likelihood estimation (MLE), which potentially can overfit the data when the numbers of the word clusters and GO clusters are large. We can apply either the smoothing techniques commonly used in the language modeling (Jurafsky and Martin, 2000) or Bayesian approach to address the issue.

D.2.3 Automatic Annotation with Correspondence LDA model


w g

cw cg

Figure 9 The information flows from the observed word w, to the word cluster cw, to the GO clusters cg and potentially to the

GO terms g.


Rationale. The LDA and mixture of LDA analyzer models are unsupervised learning models that do not utilize the GO annotation information available in the GOA corpus. In this section, we propose to extend and apply a full probabilistic generative model, the correspondence LDA (Corr-LDA) model to extract the semantic topics and learn the relationship between topics and annotation simultaneously . When given a new text document, the Corr-LDA model will extract the semantic concepts from the text and directly predict the annotation probabilistically. Thus, the Corr-LDA model tackles the tasks of extracting semantic concepts and mapping the concepts to annotation within a unified probabilistic framework. The Corr-LDA model and IB method complement each other. The similarity and difference of the IB and probabilistic generative model has been pointed out in the references (Elidan and Friedman, 2005; Slonim and Weiss, 2002). The two approaches have different optimization objectives: one strives to maximize the likelihood of observed data from a generative point of view, while the other seeks to preserve information from a discriminative point of view. Thus, we will have multiple complementary approaches to solve the difficult problem of automatic annotation.Model Specification The Corr-LDA model was introduced by Blei and Jordan (Blei and Jordan, 2003), in which they modeled the contents of images and their associated annotations. In that study, they demonstrated that the Corr-LDA model was capable of automatically annotating pictures based on the content of the pictures with excellent performance. Our task is an analogy in that we are interested in modeling the contents of text their association with GO annotations. The graphic representation of the Corr-LDA model is shown in Figure 10. We can see that, in addition to the variables of the original LDA model, the Corr-LDA model also explicitly represents the observed GO terms (g) associated with the document. The basic assumption of the “generation” of a document and its associated annotations is as follows: (1) sample a topic distribution variable ; (2) sample a topic z for each word; (3) sample a word w from distribution conditioning on z; (4) associate each annotation g stochastically with a word indexed by the variable y; and (5) sample a annotation conditioning on the topic of the word indexed by y. In this representation, the Corr-LDA embodies our fundamental assumption: it is the semantic concepts that determine the choice of both words and annotations. The Corr-LDA model resolves the ambiguity introduced by many-to-many relationship between the multiple semantic topics and the GO annotations of a document. It uses an indicator variable, , to stochastically map the annotation to the word that is responsible for its “generation,” shown in the Figure 10 as the directed arc between Nth word and the variable y. Thus, the Corr-LDA model explicitly constrains a GO annotation to be associated with one of semantic among all possible semantic topics in a document. Statistical Inference and Automatic Annotation. The original Corr-LDA model was developed for modeling images and their annotations. We have derived a variational inference algorithm for the text-based Corr-LDA model. The detailed derivation is available in Appendix 3. Given the observed words of a document, we would like to address the query: what is the probability a document should be annotated by the GO term g conditioning on its words? This can be answered by first estimating the latent topic variables based on the observed word as in the LDA model and then calculating:

where the q(zn) is the variational distribution over the latent variable zn. Then, we can annotate the document with g if the probability is greater than a predefined threshold. Experiments. We will implement the Corr-LDA model in C/C++. The original and re-annotated GOA corpus, which contains all necessary information, will be used to train the Corr-LDA model. The data will be divided into training and test sets. After training, the algorithm will return learned semantic topics and corresponding parameters for mapping the semantic concept to the GO terms. We will apply the learned model onto the test set to perform automatic annotation. We will evaluate the recall, precision, F-measure, and ROC curve analysis of the models trained with different corpus. Potential pitfalls and alternative approaches. When trained with the original GOA corpus, the correspondence LDA model potentially will fall prey of sparse training data, because a generative model is PHS 398/2590 (Rev. 09/04, Reissued 4/2006) Continuation Format Page

|C|

Nd

wz

T

y g

N

Figure 10. Graphical representation of the Corr-LDA model.

Md


unlikely to capture the GO annotations that appears only few times in the corpus. We expect the re-annotated GOA corpus potentially address the problem. We will compare performances of the model trained with different corpora to evaluate the benefit of re-annotation. Another potential pitfall is that the model depicted in Figure 10 attempts to associate GO annotations to words, which may result in overfitting problem. An obvious remedy is to change the granularity of model fitting, such as using sentence or phrases instead of words, as the units matching the annotations. Another approach is to perform more rigorous model-based segmentation of a single or concatenated collection of text documents. The latter not only can address this potentially problem but also can be used to extract the supporting evidence for an annotation as described in the next subsection.

D.2.4. Extraction of supporting evidence for predicted annotations through document segmentationRationale. In this experiment we will develop new algorithms to extract relevant sentences from the literature to provide supporting evidence for the predicted GO annotations. As the results of future automatic predictions must be validated by human experts, it is important to identify the relevant literature context that supports any predicted annotation. Specifically, we will develop algorithms combining LDA model and hidden Markov model (HMM) to capture the sequential information of the text. The rationales are two folds: (1) relaxing the “bag of words” assumption by LDA to enhance the model fitting; and (2) extracting evidence sentences or passages for predicted GO annotation. Modeling the sequential information of a text captures an essence that, when writing an article, an author does not pile words randomly (the underlying assumption of LDA) but organizes them as sentences or passages as units to convey concepts/topics. A HMM-based model can model such sequential transition of topics in the text. As the results, we can segment the text into sentences or passages related given topics, which can be used as the evidence for the predicted GO annotations. HMM and text segmentation. An HMM is a probabilistic finite state automaton that can serve as a generative (probabilistic) model for a sequence of symbols, which in our case is a sequence of words. It can be regarded as a stochastic “machine” with a finite number of states, which in our case corresponds to the topics and we will use them interchangeably. At any time, the machine would be in precisely one topic and would generate a word from the topic according to some output probability distribution. At the next time point, the machine would then stochastically move to another state according to a state-transition probability distribution. The transition could be going from the current state to the current state, in which case, the machine is essentially staying in the same state as the time proceeds. Thus when the machine is in action, it can stochastically generate a word at each time point, and mathematically, the machine defines a probabilistic model for a sequence of words. Although one may train an HMM model with a state/topic transition matrix containing all possible topics, however this approach inevitably leads overfitting, because the number of parameter to be estimated is T2, where T is total number of topics and in our case T is at the order of around hundreds to thousands. Instead, we will train a document-specific HMM combined with LDA model for each document, so that we only consider the transition matrix with major topics of a document inferred by LDA model. More specifically, we construct an HMM with K+1 states, with K states corresponding to the extracted K strongest supporting topics, which we now denote as . In addition, and another state corresponding to a background topic tB will be added to the model (See Figure 11.) The output (emission) distribution of word-usage conditioning on a topic, ti, is available from our LDA model, thus if a topic is about apoptosis, then the distribution would likely give high probabilities to words such as apoptosis, and caspase, etc. Another innovation of our method is to introduce a background topic, whose word-usage distribution is estimated according to the word counts in the whole collection. When such an HMM is in action, we thus would have a machine that can generate a mixture of words drawn from each of the topic distributions plus the background distribution. Our idea is to use such a machine to model the documents which we used to predict the GO annotation g. Intuitively, such a document would cover some of these supporting topics plus some non-relevant information, which we model using the background distribution. Although an HMM can also be


t1

t2

tKtB

…

Figure 11, A diagram of state transition automaton of a document specific HMM.


regarded as a mixture model, it has the advantage of modeling the dependency between adjacent words over a simple mixture model, thus is more appropriate for our purpose of extracting supporting fragments because words in each fragment are contiguous.

Methods. The key novelty of our approach is combining the two approaches to overcome the shortcomings of both methods. We will train LDA model as described in Section D.1. The topic word distributions of LDA can be directly integrated into HMM as the emission distribution matrices. When segmenting a new document, the main topics of the document can be initially inferred through LDA model. Then, the HMM initialized with a random topics transition matrix will be applied to the document. Inference of this document-specific HMM can be carried out using the standard Baum-Welch algorithm (Rabiner, 1989), in which the latent topics for the words and the topic transition matrix will be iteratively updated. The most likely state path will be extract from the trained HMM. Each continuous group of words that belong to the same supporting topic can be identified as a supporting fragment. The background topic naturally accommodates (or masks out) any segment not relevant to the major topics of the document. In order to ensure that a supporting fragment is relatively long, we may impose a prior on the state transition probabilities so that they are generally high, effectively forcing the HMM machine to stay in the same topic as long as it can unless the words do not match the topic well. In our previous work, a similar HMM combined with the probabilistic latent semantic indexing (Hofmann, 1999b) has been shown to be able to effectively identify variable-length relevant passages for a query [Z15]. We believe that combination with full generative approaches of LDA or Corr-LDA will further enhance the methods, especially combination of Corr-LDA and HMM will enable us not only to predict GO terms but also return structured supporting evidence.

Experiments and Evaluation. We will implement our HMM on top of the Lemur information retrieval toolkit (http://www.lemurproject.org/) and run all experiments using this toolkit. We will test the proposed HMM in two ways to test two distinct hypotheses: (1) HMM improves accuracy in extracting relevant passages from full text literature articles over the fixed-size window baseline approach. (2) HMM improves accuracy in extracting supporting passages for a predicted GO term annotation over the fixed-size window baseline approach. To test the first hypothesis, we will exploit the TREC 2006 Genomics Track data, which comprises of full text articles, to evaluate the accuracy of our HMM for extracting relevant passages to biomedical queries. The TREC 2006 Genomics Track task is precisely to retrieve passages relevant to a query, thus we can directly leverage the resources that the TREC organizers would provide. The purpose of this experiment is to test the ability of the proposed HMM in identifying relevant text fragments without distinguishing specific subtopics. We will compare the proposed HMM with the fixed-size sliding window baseline approaches as described in C8 using the passage retrieval measures defined by and the relevance judgments created by the TREC 2006 Genomics Track organizers. To test the second hypothesis, we will have human assessors to judge the extracted supporting fragments. We will compute two measures: precision and recall. Specifically, we will identify a sample of 100 predictions and have human assessors to read the top 100 most relevant supporting articles and manually identify the best supporting paragraphs/fragments. We will then use these judgments as gold standard to compute the precision and recall of the extracted fragments using the proposed HMM and compare the performance with that of the baseline.

Potential pitfalls and alternative approaches. Based on the promising results of a similar HMM in our previous work [Z2, Z15], we expect the proposed HMM will perform well for extracting supporting passages. One potential problem of the HMM estimation is that there may be multiple local maxima of the likelihood function. The way we fix the output distributions (based on LDA topics) should already help provide significant regularization on the model to alleviate the problem, and in our experiments, we will perform multiple runs to find a good local maximum. In case the local maxima problem affects the performance of our HMM, we will further regularize the model by imposing a prior on the transition probabilities so that we essentially “force” the HMM to stay in a relevant passage sufficiently long to extract a relatively long coherent passage rather than many short fragments, which could happen if the estimation procedure is trapped in a poor local maximum. The strength parameter of our prior could be set using cross validation.

D.4.3 Timeline


http://www.lemurproject.org/


In Table 3, we list our estimated timeline. In year 1, we will first collect training data and other resources that can be potentially exploited to improve the annotation accuracy. We will then mainly focus on the fine tuning and extension of LDA models. On the system side, we will study query formulation and retrieval algorithms. In year 2, we will continue the study of the LDA models and start exploring the information bottleneck approach. On the system side, we will continue developing the search engine and build an initial framework for the ProtAnn system. In year 3, we will continue the exploration the information bottleneck approach and start putting LDA models into the ProtAnn system. In year 4, we will study how to combine our methods with some existing methods and focus on evaluating all the models and the system. On the system side, we will integrate all the components and test the system by deploying it on the Web. Component evaluation will be done in each year. Table 3. Estimated Timeline

Tasks Year 1 Year 2 Year 3Data Collection & Utility tools xxxxxxxxxxAim 1 Biological concept extraction

Fine tuning of LDA xxxxxxx XxxxxxInfo. Bottleneck Model fitting

Xxxxxxxxxx xxxxx

Mixture of LDA analyzers Xxxxxxxxxx xxxxxxEvaluation xxx Xxxx xxxxxx

Aim 2Concept to GO mapping

Info. Bottleneck Mapping Xxxx xxxxxxxxxxxCorrespondence LDA xxxxx XxxxxxxxxxInformation retrieval development xxxxxx Xxxxxxxx xxxxxSupporting evidence extraction Xxxxxxxxx xxxxxxxEvaluation xxx Xxxxx xxxxxx



E. HUMAN SUBJECTSNot applicable.

F. VERTIBRATE ANIMALSNot applicable.

G. LITERATURE CITED



I. RESOURCE SHARING

Sharing of the results from this project is the primary goal of this study. The activities of this study not only test hypothesis that are specified in the Specific Aims 1 and 2, but also include implementation of several text mining algorithms and an integrated web application system. We would wish to make the results of our research available to other researcher in following format:

1. National/International Conferences or Journal Publications. We will share our research results at national/international conferences or as published journal papers. We anticipate that our project will produce 2~5 journal or conference papers per year. We will post our publications to the PubMed Central portal within a year of their publication to disseminate knowledge to broader audience.

2. Software components and Source Codes. In addition to the web-based system, we will also make the implementation of the individual algorithms available as stand alone applications. In addition, we will make the source codes available so that they can be used by other investigators.

3. Technical Reports. We will make the detailed mathematical derivation of the algorithms developed in the project publicly available as the technical report available through our departmental website and personal website. This will especially of educational value because many published algorithms omit details of derivation due to the page limits, while students and new statistical learning practitioners often need detailed information in order to learn the material.