1 the biotext project myers seminar sept 22, 2003 marti hearst associate professor sims, uc berkeley...
Post on 22-Dec-2015
213 views
TRANSCRIPT
1
The BioText Project
Myers SeminarSept 22, 2003
Marti HearstAssociate Professor
SIMS, UC Berkeley
Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech
2
BioText Project Goals
• Provide fast, flexible, intelligent access to information for use in biosciences applications.
• Focus on– Textual Information– Tightly integrated with other
resources• Ontologies• Record-based databases
3
People
• Project Leaders: – PI: Marti Hearst Co-PI: Adam Arkin
• Computational Linguistics– Barbara Rosario– Presley Nakov
• Database Research– Ariel Schwartz– Gaurav Bhalotia (graduated)
• User Interface / Information Retrieval– Kevin Li– Emilia Stoica
• Bioscience– Dr. TingTing Zhang
4
Outline
• Main Goals– System Architecture– Apoptosis problem statement
• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from
text)– Search User Interfaces– Hierarchical grouping of journals
5
BioText: Main GoalsBioText: Main Goals
Sophisticated Text Analysis
Annotations inDatabase
ImprovedSearch Interface
6
Recent Result (Schwartz & Hearst 03)
• Fast, simple algorithm for recognizing abbreviation definitions.– Simpler and faster than the rest– Higher precision and recall– Idea: Work backwards from the end
• Examples:– In eukaryotes, the key to transcriptional regulation of the
Heat Shock Response is the Heat Shock Transcription Factor (HSF).
– Gcn5-related N-acetyltransferase (GNAT)
• Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.
7
BioText: A Two-Sided ApproachBioText: A Two-Sided Approach
SwissProt
Blast
Mesh
GOWordNet
Medline
JournalFull Text
Sophisticated DatabaseDesign & Algorithms
EmpiricalComputational Linguistics
Algorithms
8
Death ReceptorsSignaling
Survival Factors Signaling
Ca++ Signaling
P53 pathway
Caspase 12
Effecter Caspases (3,6,7)
Caspase 9
Apaf 1IAPs
NFkB
Mitochondria Cytochrome c
Bax, Bak
Apoptosis
Bcl-2 like
BH3 only
Apoptosis Network
Smac
ER Stress
Genotoxic Stress
Initiator Caspases (8, 10)
AIF
Lost of Attachment Cell Cycle stress, etc
Slide courtesy TingTing Zhang
9
The issues (courtesy TingTing Zhang):
• The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published.
• The supporting experimental data are gathered in different organs, tissues, cells using various techniques.
• There are various levels of uncertainty associated with different techniques used to answer certain questions.
• Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts.
• We need to keep track of ALL the information in order to understand the system better.
10
Simple cases:
• Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using cDNA expression library from T-Lymphoma cell line KO52DA20).• Human BimEL protein is 89% identical to mouse BimEL, Human BimL is 85% identical to mouse BimL (Hybridization of mouse bim cDNA to human fetal spleen and peripheral blood cDNA library).• Bim mRNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts).• BimL protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids)• BimL deleted of the BH3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)
11
Computational Language Goals
• Recognizing and annotating entities within textual documents
• Identifying semantic relations among entities
• To (eventually) be used in tandem with semi-automated reasoning systems.
12
Main Ideas for NLP Approach
• Assign Semantics using – Statistics– Hierarchical Lexical Ontologies to
generalize– Redundancy in the data
• Build up Layers of Representation– Syntactic and Semantic– Use these in a feedback loop
14
Recent Result:Descent of Hierarchy
• Idea: – Use the top levels of a lexical
hierarchy to identify semantic relations
• Hypothesis:– A particular semantic relation holds
between all 2-word Noun Compounds that can be categorized by a MeSH pair.
15
Definition
• NC: Any sequence of nouns that itself functions as a noun– asthma hospitalizations – health care personnel hand wash
• Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.
16
• Identification• Syntactic analysis (attachments)
• [Baseline [headache frequency]]• [[Tension headache] patient]
• Our Goal: Semantic analysis• Headache treatment treatment for headache• Corticosteroid treatment treatment that uses
corticosteroid
NCs: Three tasks
17
Main Idea:
• Top-level MESH categories can be used to indicate which relations hold between noun compounds
• headache recurrence– C23.888.592.612.441 C23.550.291.937
• headache pain– C23.888.592.612.441 G11.561.796.444
• breast cancer cells– A01.236 C04 A11
18
Linguistic MotivationCan cast NC into head-modifier relation, and assume head noun has an argument and qualia structure.
– (used-in): kitchen knife– (made-of): steel knife– (instrument-for): carving knife– (used-on): putty knife– (used-by): butcher’s knife
20
How Far to Descend?• Anatomy: 250 CPs
– 187 (75%) remain first level– 56 (22%) descend one level – 7 (3%) descend two levels
• Natural Science (H01): 21 CPs– 1 (4%) remain first level– 8 (39%) descend one level – 12 (57%) descend two levels
• Neoplasm (C04) 3 CPs:– 3 (100%) descend one level
21
Evaluation• Apply the rules to a test set• Accuracy:
– Anatomy: 91% accurate– Natural Science: 79%– Diseases: 100%
• Total:– 89.6% via intra-category averaging– 90.8% via extra-category averaging
22
Summary of NC Work
• Lexical hierarchy useful for inferring semantic relations
• Works because semantics are constrained and word sense ambiguity is not too much of a problem
• Can it be extended to other types of relations?– Preliminary results on one set of relations
are promising.
23
Database Research Issues
• Efficiently and effectively combining – Relational databases & Text– Hierarchical Ontologies– Layers of Annotations
24
Interface Issues
• Create intuitive, appealing interfaces that are better than what’s currently out there.
• Start with existing assigned metadata
• As text analysis improves, incorporate the results into the interface.
32
Some Recent Work
• Organizing BioScience Journal Names– Currently there are > 3500
• Idea:– Group them into faceted hierarchies
semi-automatically– Using clustering of title terms,
synonym similarity via WordNet, and other techniques
35
Summary
• BioText aims to improve access to bioscience information via– Sophisticated language analysis– Integration of results into
• Annotated database• Flexible user interface
• Eventual goal– Semi-automated mining and
discovery