topic extraction from biology literature: prior, labeling, and switching qiaozhu mei
TRANSCRIPT
A Sample Topic
filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626
actin filamentsflight muscleflight muscles
labels
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
Topic/Theme Extraction
• A theme/topic is represented with a multinomial distribution over words
• Unigram language models – Easier to interpret– Easy to add prior– Easy for retrieval
• Assumption:– K themes in a collection– A document covers multiple themes
Topic Extraction v.s. Clustering
• Topic Extraction: – Effective to reveal the latent topics, and find most
relevant documents to a topic– Better interpretation, worse accuracy– Effective to add priors (control the topics)
• Clustering algorithms:– Effective to assign documents into non-overlapped
clusters– Better accuracy, worse interpretation– Hard to control
Topic Extraction (Results)
corpora (0.0438967 )allata (0.0315774 )hormone (0.0249687 )juvenile (0.0184049 )insulin (0.0174549 )embryos (0.0165997 )neurosecretory (0.0127734 )embryo (0.0124167 )biosynthesis (0.0118067 )cardiaca (0.00969471 )sexta (0.0088941 )medium (0.00865245 )iran (0.00703376 )mannose (0.00668768 )volume (0.00661038 )synapse (0.00652483 )injected (0.00636151 )
Related documents
44 biosis:199598006316: 44 biosis:200000292072: 44 biosis:199293065558: 44 biosis:199799595920: 44 biosis:199395062782:
stimulatory effect of octopamine on juvenile hormone biosynthesis in honey bees (apis mellifera): physiological and immunocytochemical evidence
• May want a more general topic
• How to tell the algorithm to find a more general topic, like “behavioral maturation”?
Topic Extraction (Results cont.)
pollen (0.467911 )foraging (0.0373205 )foragers (0.0365857 )collected (0.0318249 )grains (0.0314324 )loads (0.025104 )collection (0.0208903 )nectar (0.0185726 )sources (0.0113751 )collecting (0.00999529 )types (0.00978636 )pellets (0.00942175 )germination (0.00733012 )load (0.00646375 )stored (0.00599516 )amount (0.00481306 )trips (0.00478013 )
Related Documents
13 biosis:200200039990: 13 biosis:199900297835: 13 biosis:200100318017: 13 biosis:199497516580: 13 biosis:200000045397:
the response of the stingless bee melipona beecheii to experimental pollen stress, worker loss and different levels of information input
• Biased towards “Pollen”
• Not precisely covering “foraging”
• How to tell the algorithm to focus on “foraging”?
Topic Extraction (Full Results)
• 100 topics from biosis-bee: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic.html
• 5 themes for query “food” in biosis-bee; 500 documents: http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-food-5-basic.html
Incorporating Topic Priors
• Either topic extraction or clustering:– Cannot guarantee the themes are expected– User exploration: usually has preference.– E.g., want one topic/cluster is about foraging
behavior
• Use prior to guild the theme extraction– Prior as a simple language model– E.g. forage 0.2; foraging 0.3; food 0.05; etc.
Incorporating Topic Priors
Original EM:
EM with Prior:
Prior: language model; interpreted as pseudo counts
Prior
Prior
Incorporating Topic Priors (results)foraging 0.0498044food 0.0472535foragers 0.0310718dance 0.0266078source 0.0254369nectar 0.0162739distance 0.0141869forage 0.0141503information 0.0129047dances 0.012684hive 0.0124987landmarks 0.0119087dancing 0.0109375waggle 0.0101672feeder 0.0101266rate 0.0085641sources 0.00825884recruitment 0.00813717forager 0.00796914
Prior:
forage 0.1foraging 0.1food 0.1source 0.1
Incorporating Topic Priors (results: cont.)
age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439
Prior:
labor 0.2division 0.2
Incorporating Topic Priors (results: cont.)
gene 0.0648303expression 0.0486273sequence 0.0407999sequences 0.0311126brain 0.0233977drosophila 0.020891cdna 0.0186153predict 0.0166939expressed 0.0166521amino 0.0126359dna 0.010655genome 0.0101629conserved 0.0098135bp 0.00908649nucleotide 0.00906794phylogenetic 0.00887771encoding 0.00866418melanogaster 0.00798409
Prior:
brain 0.1predict 0.1gene 0.1expresion 0.1
Incorporating Topic Priors (results: cont.)
behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045
Prior:
behavioral 0.2maturation 0.2
Incorporating Topic Priors (Full results)
• 30 topics from biosis-bee (first 7 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior.html
• 30 topics from biosis-bee (first 2 topics w/ prior): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-30-prior3.html
Labeling a Topic
• Themes (Topic models) can be hard to interpret.• Give meaningful labels to a topic is hard
What is a Good Label?
• Suggesting the theme (relevance)
• Understandable – phrases?
• High coverage inside topic– A theme is often a mixture of concepts
• Discriminative across topics– A theme is usually in the context of k topics
• …
Our Method
• Guarantee understandability with a pre-processing step– Use phrases as candidate topic labels– Other possible choices: entities
• Satisfy relevance, coverage, and discriminability with a probabilistic framework
Good labels = Understandable + Relevant +
High Coverage + Discriminative
Labeling a Topic: Candidate Labels
• Phrase generation: – Statistically significant 2-grams – Hypothesis testing– T-test used; ranked by t-score
• Other choices?– Entities? – Behavior ontology?– GO: hard to use, because they are not real phrases
from literature.
Labeling a Topic: Semantic Relevance
• Zero-order: use phrases which well cover the top words:
Clustering
dimensional
algorithm
birch
shape
Latent Topic
…
Good Label:“clustering algorithm”
body
Bad Label:“body shape”
…
Labeling a Topic: Semantic Relevance (cont.)
• First-order: use phrases with similar context:
Clustering
dimension
partition
algorithm
hash
Clustering
hash
dimension
algorithm
partition
SIGMOD Proceedings
Topic
… …
P(w|) P(w|l)
D(|l)
Good Label:“clustering algorithm”
Clustering
hash
dimension
join
algorithm
… Bad Label:“hash join”
Labeling a Topic (results)female (0.0892427 )females (0.0856834 )male (0.0854142 )males (0.0812643 )sex (0.0577668 )reproductive (0.0214618 )ratio (0.0142873 )alleles (0.0133912 )diploid (0.0125172 )offspring (0.0120271 )sexes (0.0116374 )investment (0.0115359 )mating (0.00902159 )number (0.00823397 )success (0.00785498 )sexual (0.00751456 )determination (0.00663546 )size (0.00633002 )
Labels:
sex ratio (2.49468) (32 ); male female (2.29508) (51 ); sex determination (2.16534) (21 ); female flowers (1.83686) (23 ); sex alleles (1.79415) (16 ); multiple mating (1.72684) (19 );
Labeling a Topic (results cont.)
Labels:
juvenile hormone 2.44992 117hormone jh 1.58432 49larval instar 1.53676 20worker larvae 1.52398 51corpora allata 1.50391 34
hormone 0.0536175jh 0.0518038juvenile 0.0466941development 0.0387031larval 0.0276814hemolymph 0.0216493pupal 0.0189934stage 0.0188286glands 0.0173832larvae 0.0169996adult 0.0154695instar 0.0149492haemolymph 0.0140053vitellogenin 0.0131076caste 0.0124822protein 0.0116558glucose 0.0112673corpora 0.0105111
Labeling a Topic (results)
Labels
food source -6.72378 107nectar foraging -7.11784 28nectar foragers -7.58965 47nectar source -7.78975 16food sources -7.8487 72waggle dance -8.21514 31
foraging 0.0498044food 0.0472535foragers 0.0310718dance 0.0266078source 0.0254369nectar 0.0162739distance 0.0141869forage 0.0141503information 0.0129047dances 0.012684hive 0.0124987landmarks 0.0119087dancing 0.0109375waggle 0.0101672feeder 0.0101266rate 0.0085641recruitment 0.00813717forager 0.00796914
Prior
0 forage 0.10 foraging 0.10 food 0.10 source 0.1
Labeling a Topic (full results)
• 100 topics from biosis-bee (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/bee-100-basic-l.html
• 100 topics from biosis-fly-genetics (w/ labels): http://sifaka.cs.uiuc.edu/~qmei2/data/beespace/fly-100-l.html
Context Switching
• Utilize topic extraction for concept switching (two possible ways)– Label the same topic model with phrases in
another context– Use the topic model from context A as prior to
extract topics from context B
foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228
foraging behavior 2.45263 27age related 2.29676 20drosophila larvae 2.15361 67feeding rate 1.99218 17apis mellifera 1.9847 23diptera drosophilidae 1.9 25
foraging trip 2.31174 21nectar foragers 2.23428 47tremble dance 2.21407 10returning foragers 2.18954 16food sources 2.14453 72food source 2.13647 107foraging strategy 2.101 14individual foraging 2.08334 16waggle dance 2.07836 31
Labels with bee context
Labels with fly context
foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051
foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228
Speed of topic extraction
# documents # themes Running time
500 5 8.3 s
500 10 10.6 s
1000 5 17.6 s
10k 30 350 s
16k 150 4000 s