bibm11 b223 slides(2)
Post on 06-Apr-2018
217 Views
Preview:
TRANSCRIPT
-
8/3/2019 Bibm11 b223 Slides(2)
1/49
Catalogue with Probabilistic Topic Models
1 2 1 1 3
, , , ,
1College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA
2Dept. of Computer Science at Central China Normal University, Wuhan, China3Department of Computer Science, University of Vermont, Burlington, VT, USA
1
-
8/3/2019 Bibm11 b223 Slides(2)
2/49
.
thought of as the complete set of DNA sequences that codes for the
hereditary material that is passed on from generation to generation.
These DNA sequences include all of the genes (the functional andphysical unit of heredity passed from parent to offspring) and
genetic information) included within the genome.
us, genom cs re ers to t e sequenc ng an ana ys s o a o t esegenomic entities, including genes and transcripts, in an organism.
2
-
8/3/2019 Bibm11 b223 Slides(2)
3/49
In recent years we see growth of GenBank and NCBI with the
3
-
8/3/2019 Bibm11 b223 Slides(2)
4/49
As the growth of GenBank and NCBI, a lot of annotating algorithmsstandard reference and attach meta-information to the sequences.
4
-
8/3/2019 Bibm11 b223 Slides(2)
5/49
Back rounds: metaBack rounds: meta--informationinformation The annotated meta-information involves hierarchical data such as
NCBI Taxonomy and Gene Ontology.
5
-
8/3/2019 Bibm11 b223 Slides(2)
6/49
Challenges:Challenges: MetagenomicsMetagenomics With the fast advancing sequencing techniques, large amounts of
sequenced genomes and meta-genomes from uncultured microbial.
The goal of metagenomics is to study the genome-wide gene-expression,
human body) and understand the underlying biological processes. 6
-
8/3/2019 Bibm11 b223 Slides(2)
7/49
Whats the major research questions of our study?
We use our data mining framework to investigate
following questions:samples, what genomes are there?
Answering this question requires mapping the meta-genomic reads totaxonomic units usuall a homolo -based se uence ali nment and this
task is also known as taxonomic classification or taxonomic analysis).2) What are the major functions of these genomes?
The answers to this question involve annotating the major functional units(such as signal transduction, metabolic capacity and gene regulatory) onthe genome-level (a.k.a. functional analysis).
Our research objective: We aim to develop a new method that is able to analyze the
genome-level composition of DNA sequences, in order to
same species, tell their functional roles. 7
-
8/3/2019 Bibm11 b223 Slides(2)
8/49
Structural annotation and protein encoding regions
Homology-based functional analysis op c o e s
8
-
8/3/2019 Bibm11 b223 Slides(2)
9/49
Structural annotation and rotein encodin re ions
Annotating the regions of known open reading frames (ORFs),
non-coding genes (rRNA, tRNA, miRNA), Promoters and UTRsin the DNA sequences
9
-
8/3/2019 Bibm11 b223 Slides(2)
10/49
Structure annotation and protein encoding regions
(continue)
stan ar re erence sequences ave eta e structuraannotations of both non-protein encoding regions (such as tRNA)
and protein encoding regions (CDS) as well as the correspondinggene names (if applicable). The GenBank accession number ofeach reference sequence is available on each NCBI online query.
10
-
8/3/2019 Bibm11 b223 Slides(2)
11/49
Structural annotation and protein encoding regions
Homology-based functional analysis op c o e s
11
-
8/3/2019 Bibm11 b223 Slides(2)
12/49
Functional anal sis - overview
Functional analysis
Uncover the major gene functions related to the genomic
Re uires ex lainin the biochemical activit a.k.a. molecular
function) of gene product, identifying the biology process towhich the gene or gene product contribute (including information
gene).
12
-
8/3/2019 Bibm11 b223 Slides(2)
13/49
Homology-based functional analysis(Richter and
Huson, 2009)
omo ogy- ase approac as een recent y ntro uce to ac evefunctional annotation for metagenomic reads (Richter and Huson,
2009).
The framework begins with a homology based BLASTX algorithm to
in NCBI database.
The BLASTX hits will associate fragments with related protein IDand gene names. After that, with the help of the Gene Ontology
GO terms, thus provides an overview of gene function and productsfor metagenomic fragments.
13
-
8/3/2019 Bibm11 b223 Slides(2)
14/49
Homology-based functional analysis(Richter and
Huson, 2009)
GO terms obtained from database identifier ma in Richter and Huson 2009
14
-
8/3/2019 Bibm11 b223 Slides(2)
15/49
Limitations with Homology-based Functional
Analysis Methods
1. omo ogy- ase approac es very muc rep y on t e resu t o ocasequence alignment (such as BLAST and BLASTX) to the known
open reading frames (ORF). The BLAST-like local alignment may either return hundreds of hits, or return no
hits, depending on the threshold of E-value used. In the latter case, the currentmethods are unable to provide any functional annotation. In the former case, itusua y ac s o a proper e- rea er o ur er re uce e s, w c ma es e
functional annotation some how ambiguous (with hundreds of probableexplanation)
. -any insight about the major functional capabilities of genomes
(like which gene functions are more commonly shared by strainsrom e same spec es , as ere s no pr or y or e anno a eterms.
15
-
8/3/2019 Bibm11 b223 Slides(2)
16/49
Structural annotation and protein encoding regions
Homology-based functional analysis op c o e s
16
-
8/3/2019 Bibm11 b223 Slides(2)
17/49
To ic ModelinTo ic Modelin -- IntuitiveIntuitive
Of all the sensory impressions proceeding to
the brain the visual ex eriences are thedominant ones. Our perception of the worldaround us is based essentially on the
messages that reach the brain from our eyes.For a lon time it was thou ht that the retinal Assume the data weimage was transmitted point by point to visual
centers in the brain; the cerebral cortex was amovie screen, so to speak, upon which theima e in the e e was ro ected. Throu h the
sensory, brain,visual, perception,
retinal, cerebral cortex,
some parameterizedrandom rocess.
discoveries of Hubel and Wiesel we nowknow that behind the origin of the visualperception in the brain there is a considerably
more com licated course of events. B
eye, cell, opticalnerve, image Learn the parameters
that best explain thefollowing the visual impulses along their pathto the various cell layers of the optical cortex,
Hubel and Wiesel have been able to
demonstrate that the messa e about the
data.
Use the model toimage falling on the retina undergoes a step-
wise analysis in a system of nerve cells
stored in columns. In this system each cell
has its s ecific function and is res onsible for
predict (infer) newdata, based on data
a specific detail in the pattern of the retinal
image.
.17
-
8/3/2019 Bibm11 b223 Slides(2)
18/49
Basic unit.
Item from a vocabulary indexed by {1, . . . ,V}.
Document = , , , . . . , .
Collection o a o ocumen s, eno e y = w ,w , . . . ,w .
To ic Denoted by z, the total number is K. Each topic has its unique word distribution p(w|z)
18
-
8/3/2019 Bibm11 b223 Slides(2)
19/49
Background & Existing Techniques of Generative
Latent Topic ModelsLikelihood of
*
topic z
Word-Topic Prior Probability
The probabilistic latent semantic indexing (PLSI) model
Assumption:
Each document has a mixture ofktopics.
Fitting the model involves:
Estimating the topic specific word
distributions w z and document s ecific
PLSI Model (Hoffman, 2001)topic distributionsp(z
k
|dj
) from the corpse
via maximum likelihood estimation (MLE).19
-
8/3/2019 Bibm11 b223 Slides(2)
20/49
Latent Dirichlet Allocation LDA Model Blei, 2003
In PLSI model, the topic mixtured~Dir()
k j documents are fixed once themodel is estimated. For newcomin document the modelneeded to be re-estimated. Thusit is not scalable.
( | ) ~ ( )j d p z d Multi
The LDA model treats theprobability of latent topics foreach documentp(z|d) and the
( | ) ~ ( )j ji p w z Multi
for each latent topicp(w|z) aslatent random variables which
are sub ect to chan e when new
~ ( )j Dir
document comes., ,
.
, ,.
( | , , )
wi d
i j i j
wi i d
i j i
n np z j w
W n T n
+ +=
+ +-i -wiw z
20
-
8/3/2019 Bibm11 b223 Slides(2)
21/49
LDA Model Estimation - Gibbs Sampling
Monte Carlo process (Griffiths, 2004)
( | , , ) ( | , , ) ( | , )wi i i wip z j w p w z j p z j= = =-i -wi -i -wi -i -wiw z w z w z
,
.( | , , ) ( | , , , ) ( | , )
wi
i j j j j
i wi i
np w z j p w z j p d
+= = = =-i -wi -i -wi -i -wiw z w z w z
,
,.
( | , ) ( | ) ( | , )
d
i jd d dd
i
n p z j p z j p d T n
+= = = =+-i -wi -i -wiw z w z
,i j
( | , , , )j jip w z j = =-i -wiw z
j j j
( | , ) ( , | ) ( )d d d p p p -i -wi -i -wiw z w z
in which
Since
and
, ,-i -wi -i -wi
( , | ) ~ ( )j j p Multi -i -wiw z
( , | ) ~ ( ) p Multi -i -wi
w z
( ) ~ ( )d p Dir
and . It follows that We have( ) ~ ( )j p Dir
,( | , ) ~ ( ) j wi
i j p Dir n +-i -wiw z,( | , ) ~ ( )
d d
i j p Dir n +-i -wiw z21
-
8/3/2019 Bibm11 b223 Slides(2)
22/49
-
Given the word-topic posterior probability, the Monte
Carlo process becomes really straightforward, which
each facet to appear) to determine the assignment ofto ics to each words for the next round.
Given probability for each word:
, , , ...wi i -i -wi
New topic assignment for each word.
22
-
8/3/2019 Bibm11 b223 Slides(2)
23/49
23
-
8/3/2019 Bibm11 b223 Slides(2)
24/49
24
-
8/3/2019 Bibm11 b223 Slides(2)
25/49
Experiments
25
-
8/3/2019 Bibm11 b223 Slides(2)
26/49
Experiment: Inferring Functional Groups from
Microbial Gene Catalogue with Topic Models
,non-redundant CDs catalogue, we show that the configuration of
functional groups in meta-genome samples can be inferred by.
The robabilistic to ic modelin is a Ba esian method that is able to
extract useful topical information from unlabeled data. When used tostudy microbial samples the functional elements (including
,KEGG pathway mappings) bear an analogy with words.
Estimating the probabilistic topic model can uncover theconfiguration of functional groups (the latent topic) in each sample.Which ma be further used to stud the enot e- henot econnection of human disease.
26
-
8/3/2019 Bibm11 b223 Slides(2)
27/49
Ex erimental Data Collection
In our experiment, we conduct a probabilistic topic modelingex eriment to identif functional rou s from human ut microbialcommunity data is generated by [Qin, et al. 2010], which is openlyaccessible via http://gutmeta.genomics.org.cn/
The human gut microbial samples from[Qin, et al. 2010] belong to bothhealth sub ects HS and atients with
inflammatory bowel disease (IBD).Specifically, the IBD patients are from,
Crohns disease (CD), and the othergroup with ulcerative colitis (UC).
In total, there are 85 healthy samples,15 UC samples and 12 CD samples.
27
-
8/3/2019 Bibm11 b223 Slides(2)
28/49
Experimental Data Collection (continue)
Accordin to Qin et al. 2010 the Illumina GA reads from humangut microbial samples are firstly assembled into longer contigs. Afterthat, the Glimmer program was used to predict protein-encoding
.
The predicted CDs sequences were then aligned to each other andform a non-redundant CDs catalog (a.k.a. minimal gut genome). Thenon-redundant CDs catalog consists of 3,299,822 non-redundantCDs se uences with an avera e len th of 704 b .
CDs_id: MH0001
Name: GL0006996 MH0001 [Lack 3'-end] [mRNA] locus=scaffold96 9:1:1206:-_ _ _ _ _ _ Length: 1206
COG/KO: COG4799 K01966
Pathway maping: map00280,map00640
28
axonom c eve : spec es - u ac er um e gens
-
8/3/2019 Bibm11 b223 Slides(2)
29/49
Experimental Data Collection (continue)
In our experiment, three types of functional elements are derivedfrom the non-redundant CDs catalog, i.e. the NCBI taxonomic levelindicators, indicator of gene orthologous groups and KEGG pathwayindicators.
- ,obtained by carrying out BLASTP alignment against the NCBI NRdatabase. The taxonomical level of each non-redundant CDs
based algorithm. The taxonomic abundance data for each samplecan be computed by counting the indicators of NCBI taxonomicaleve s.
The assignments of gene orthologous indicator and KEGG pathway
indicator are achieved b BLASTP ali nment of the amino-acidsequence from predicted CDs to the eggNOG database and KEGGdatabase.
29
-
8/3/2019 Bibm11 b223 Slides(2)
30/49
Ex erimental Data Collection continueGenus Clostridium
Genus BacteroidesNCBI Taxonomic Levels
Class Clostridia
Genus Bacillus
COG0642 : Signal transduction histidine kinase
COG1132 : "ABC-type multidrug transport system, ATPase and permease
components"
Indicators
COG0438 : Glycosyltransferase
map00230 : Metabolism_Nucleotide Metabolism_Purine metabolismKEGG Pathway Indicators
The union of unique functional elements jointly defines a fixed word
_ _
map00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism
voca u ary. n o a , ere are , axonom c eveindicators, with a vocabulary size of 748; there are a total of1,293,764 gene orthologous group indicators, with a vocabulary sizeof 4667; and there are 953,493 KEGG pathway indicators, with a
vocabulary size of 237. 30
-
8/3/2019 Bibm11 b223 Slides(2)
31/49
Groups of functional elements in microbial
commun ty
Given non-redundant CDs catalog, and derived functional elements,-
31
functional elements (a.k.a. functional groups).
-
8/3/2019 Bibm11 b223 Slides(2)
32/49
Generative rocess of ro osed model
Commonly shared functional elements across samples may suggestfunctional similarit and biolo ical relevance amon sam les. Tocover such information, a genome-wide background distribution offunctional elements need to be estimated, which leads to the
0 .
32
-
8/3/2019 Bibm11 b223 Slides(2)
33/49
Illustration of the background topic of gene
Background Topic - Indicator of Gene OGs
Gene OGs Indicator Descriptions Probability
COG0463Glycosyltransferases involved in cell
wall biogenesis
0.00813
.
COG0582 Integrase 0.00698
COG1132ABC-type multidrug transport system,
"0.00689
COG0438 Glycosyltransferase 0.00664
COG0745
Response regulators consisting of a
CheY-like receiver domain and a 0.00644
winged-helix DNA-binding domain
COG1396 Predicted transcriptional regulators 0.00595
COG0577ABC-type antimicrobial peptide
0.00594ranspor sys em, permease componen
COG2207AraC-type DNA-binding domain-
containing proteins0.00389
COG3250 Beta- alactosidase/beta- lucuronidase 0.00344
33
-
8/3/2019 Bibm11 b223 Slides(2)
34/49
Illustration of the background topic of KEGG
w yBackground Topic - KEGG Pathway Indicator
w
map00230Metabolism_Nucleotide Metabolism_Purine
metabolism0.0333
Metabolism_Carbohydrate Metabolism_Fructoseand mannose metabolism
.
map00500Metabolism_Carbohydrate Metabolism_Starch and
sucrose metabolism0.0260
Metabolism Nucleotide Metabolism P rimidinemap00240
_ _
metabolism0.0222
map00350Metabolism_Amino Acid Metabolism_Tyrosine
metabolism0.0221
map00260e a o sm_ m no c e a o sm_ yc ne,
serine and threonine metabolism"0.0220
map00010Metabolism_Carbohydrate Metabolism_Glycolysis /
Gluconeo enesis0.0190
map00620Metabolism_Carbohydrate Metabolism_Pyruvate
metabolism0.0176
ma 00251Metabolism_Amino Acid Metabolism_Glutamate
0.0169
map00550Metabolism_Glycan Biosynthesis and
Metabolism_Peptidoglycan biosynthesis 0.0168 34
U d l i i h
-
8/3/2019 Bibm11 b223 Slides(2)
35/49
Uncovered latent topics with respect to
xIllustration of the most relevant latent to ics with
respect to different taxa
Topic ID MI Score Topic ID MI Score Topic ID MI Scoream y_ n er
obacteriaceae Topic 48 0.02476 Topic 121 0.00915 Topic 31 0.00279
genus_Clostri
dium Topic 50 0.01628 Topic 153 0.01001 Topic 95 0.00765
genus_ ac er
oides Topic 156 0.03030 Topic 77 0.02018 Topic 52 0.01661phylum_Bact
eroidetes Topic 132 0.00476 Topic 165 0.00260 Topic 67 0.00257
p y um_ rm
icutes Topic 0 0.01256 Topic 99 0.00550 Topic 193 0.00212
,information score (MI score). The MI severs as a relevance measurementbetween taxa and latent topics. It shows that phylum Firmicutes is most relevant
. ,Topic 50, 153, 95 and genus Bacteroides is most relevant to Topic 156, 77, 52.
35
U d l t t t i ith t t
-
8/3/2019 Bibm11 b223 Slides(2)
36/49
Uncovered latent topics with respect to
xIllustration of top-ranked latent topics with respect
MH0001
p(topic|sampl
e) O2.UC-1
p(topic|sampl
e) V1.CD-1
p(topic|sampl
e)
o eren m cro a samp es
Topic 0 0.475 Topic 0 0.363 Topic 0 0.286
Topic 124 0.116 Topic 95 0.101 Topic 61 0.124
Topic 181 0.103 Topic 143 0.062 Topic 12 0.116
Topic 159 0.040 Topic 83 0.059 Topic 115 0.050
Topic 86 0.027 Topic 65 0.056 Topic 52 0.048
Topic 72 0.018 Topic 139 0.034 Topic 32 0.037
Topic 19 0.017 Topic 59 0.033 Topic 50 0.036
Discoveries : the probability of Topic 0 in Healthy and UC samples (0.475 inMH0001 and 0.363 in O2.UC-1) is much higher than that in CD samples (0.286 inV1.CD-1). This suggests that for CD samples, the proportion of bacteria belong tophylum Firmicutes is significantly reduced. The prevalence of Topic 95 and 52 insamples O2.UC-1 and sample V1.CD-1 may indicate the existence and possibly
high abundance of genus Clostridium and genus Bacteroides, correspondingly.36
U d l t t t i ith t t
-
8/3/2019 Bibm11 b223 Slides(2)
37/49
Uncovered latent topics with respect to
x
37
-
8/3/2019 Bibm11 b223 Slides(2)
38/49
Our discoveries from the results is evidenced by the recent
scover es n eca m cro o a s u y o n amma ory owe sease(IBD) patients [Gerber, 2007], [Harry S. et. al. 2006], [Manichanh C
et al., 2006], [Walker A. et. al. 2011].
It has been reported that there is a significant reduction in the
samples, which is consistent with our results.
This can be explained by the fact mucosal microbial diversity isreduced in IBDs, particular in CD, which is associated with bacterial
superficial; therefore, the reduction of phylum Firmicutes in UC is notsignificant.
38
-
8/3/2019 Bibm11 b223 Slides(2)
39/49
Based on the functional elements derived from the non-
redundant CDs catalogue, we have shown that theconfiguration of functional groups encoded in the gene-
-applying probabilistic topic modeling to functional elementsderived from the non-redundant CDs catalogue.
The latent topics estimated from human gut microbial samples
study, which demonstrate the effectiveness of the proposed
method.
39
-
8/3/2019 Bibm11 b223 Slides(2)
40/49
In the proposed model, the number of functional group has to
be specified in advance, or iteratively tuned by criteria such aslog-likelihood and perplexity.
,Bayesian models (such as HDP model) to handle theuncertainty in the number of functional groups, which providethe flexibility of modeling microbial sequences with unknownfunctional group numbers.
40
-
8/3/2019 Bibm11 b223 Slides(2)
41/49
uest ons
41
-
8/3/2019 Bibm11 b223 Slides(2)
42/49
Backup Slides
42
M l I f i
-
8/3/2019 Bibm11 b223 Slides(2)
43/49
Mutual Information
After estimating the topic model and assigning a latent topic to each,
functional element indicators (i.e. NCBI taxonomic level indicators,
indicator of gene orthologous groups and KEGG pathwayn ca ors can e o a ne y ca cu a ng e mu ua n orma on(MI) between functional element indicators and obtained latenttopics based on the final latent topic assignments to functionalelements.
( , )( , ) ( , )log
g t
g t g t
p R ZMI R Z p R Z =
in which Rg
and Ztare binary indicator variables corresponding to
t
the functional element and the latent topic, respectively. Thevariable pair (Rg,Zt) indicates whether a latent topic has beenassigned to a specific functional element.
43
Lik lih d C i
-
8/3/2019 Bibm11 b223 Slides(2)
44/49
Likelihood Comparison
( | ) ( | , ) ( | )t t t
zt
T
t z z t z p p z p z d
= w z w
( ) ( )
0
( ) ( )1 0
( ) ( )( ) ( ). .
( ) ( ) ( ) ( )
i i
i i
w wTT
tw w
W Wt t
n nW W
n W n W
=
+ + =
+ +
44
Lik lih d C i ( ti )
-
8/3/2019 Bibm11 b223 Slides(2)
45/49
Likelihood Comparison (continue)
( | ) ( | , ) ( | )t t t
zt
T
t z z t z p p z p z d
= w z w
( ) ( )
0
( ) ( )1 0
( ) ( )( ) ( ). .
( ) ( ) ( ) ( )
i i
i i
w wTT
tw w
W Wt t
n nW W
n W n W
=
+ + =
+ +
45
P l it C i
-
8/3/2019 Bibm11 b223 Slides(2)
46/49
Perplexity Comparison
The perplexity is calculated for held-out testing data. In ourex eriment we use a 50% subset of the functional elements astraining data and the other 50% as testing data.
,from the same sample are equally split to both subsets. In practice, itis the inverse predicted model likelihood of data in held-out testing
, .
smaller perplexity value indicates better model fitting.
1
1
log( ( ))( ) exp
test
test
D
j
test D t
jj
p perplexity D
N
=
=
=
jw
46
P l it C i ( ti )
-
8/3/2019 Bibm11 b223 Slides(2)
47/49
Perplexity Comparison (continue)
47
Dirichlet Process (DP) as a Non-Parametric Mixture Models
-
8/3/2019 Bibm11 b223 Slides(2)
48/49
c et ocess ( ) as a o a a et c tu e ode s
G0 ~ DP(,H), in which is a concentration parameter andHis a base measure definedon a sample space . By its definition, for any finite measurable partition of: {A1,
G G ~ Dirichlet H H .
1kDirichlet Process can also be constructed by stick-breaking construction as follows:
0
1
k k
k=
=1
, ,k k i k
i=
Dirichlet process Dirichlet rocess constructed b stick-breakinby its definition: construction:
- Data samplexi drawn from a base distribution with associated parameters k
48,in which
The weights of mixture components = {k} (k=1,,) are also refer to as ~ GEM().
Hierarchical Dirichlet Process (HDP)
-
8/3/2019 Bibm11 b223 Slides(2)
49/49
( )~ 0 ,
measure across the corpora and defines a set of child random probability measures Gj ~DP(0, G0) for each documentj, which leads to different document-level distribution over
semantic mixture com onents: G A G ~ Dirichlet G G
Each Gj can also be constructed by stick-breaking construction as:
1
( ) j jk k k
G
=
= in whch j={jk} (k=1,,) specifies the weights of mixture component indicatork.
Substitute the stick-breaking construction ofG0 and Gj,
1 1
0 0,..., ~ ( ,..., )r r
jk jk k k
k K k K k K k K
Dirichlet
Based on the aggregation properties of Dirichlet distribution
and its connection with Beta distribution, it shows that:1k k
0 0
11
' (1 ' ), ' ~ , 1 jk jk jl jk k l ll
Beta ==
=
It then follows that j~ DP(0, ) Stick-breaking construction of
49
hierarchical Dirichlet process
top related