bibm11 b223 slides(2)

8/3/2019 Bibm11 b223 Slides(2)

1/49

Catalogue with Probabilistic Topic Models

1 2 1 1 3

, , , ,

1College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA

2Dept. of Computer Science at Central China Normal University, Wuhan, China3Department of Computer Science, University of Vermont, Burlington, VT, USA

1

8/3/2019 Bibm11 b223 Slides(2)

2/49

.

thought of as the complete set of DNA sequences that codes for the

hereditary material that is passed on from generation to generation.

These DNA sequences include all of the genes (the functional andphysical unit of heredity passed from parent to offspring) and

genetic information) included within the genome.

us, genom cs re ers to t e sequenc ng an ana ys s o a o t esegenomic entities, including genes and transcripts, in an organism.

2

8/3/2019 Bibm11 b223 Slides(2)

3/49

In recent years we see growth of GenBank and NCBI with the

3

8/3/2019 Bibm11 b223 Slides(2)

4/49

As the growth of GenBank and NCBI, a lot of annotating algorithmsstandard reference and attach meta-information to the sequences.

4

8/3/2019 Bibm11 b223 Slides(2)

5/49

Back rounds: metaBack rounds: meta--informationinformation The annotated meta-information involves hierarchical data such as

NCBI Taxonomy and Gene Ontology.

5

8/3/2019 Bibm11 b223 Slides(2)

6/49

Challenges:Challenges: MetagenomicsMetagenomics With the fast advancing sequencing techniques, large amounts of

sequenced genomes and meta-genomes from uncultured microbial.

The goal of metagenomics is to study the genome-wide gene-expression,

human body) and understand the underlying biological processes. 6

8/3/2019 Bibm11 b223 Slides(2)

7/49

Whats the major research questions of our study?

We use our data mining framework to investigate

following questions:samples, what genomes are there?

Answering this question requires mapping the meta-genomic reads totaxonomic units usuall a homolo -based se uence ali nment and this

task is also known as taxonomic classification or taxonomic analysis).2) What are the major functions of these genomes?

The answers to this question involve annotating the major functional units(such as signal transduction, metabolic capacity and gene regulatory) onthe genome-level (a.k.a. functional analysis).

Our research objective: We aim to develop a new method that is able to analyze the

genome-level composition of DNA sequences, in order to

same species, tell their functional roles. 7

8/3/2019 Bibm11 b223 Slides(2)

8/49

Structural annotation and protein encoding regions

Homology-based functional analysis op c o e s

8

8/3/2019 Bibm11 b223 Slides(2)

9/49

Structural annotation and rotein encodin re ions

Annotating the regions of known open reading frames (ORFs),

non-coding genes (rRNA, tRNA, miRNA), Promoters and UTRsin the DNA sequences

9

8/3/2019 Bibm11 b223 Slides(2)

10/49

Structure annotation and protein encoding regions

(continue)

stan ar re erence sequences ave eta e structuraannotations of both non-protein encoding regions (such as tRNA)

and protein encoding regions (CDS) as well as the correspondinggene names (if applicable). The GenBank accession number ofeach reference sequence is available on each NCBI online query.

10

8/3/2019 Bibm11 b223 Slides(2)

11/49



11

8/3/2019 Bibm11 b223 Slides(2)

12/49

Functional anal sis - overview

Functional analysis

Uncover the major gene functions related to the genomic

Re uires ex lainin the biochemical activit a.k.a. molecular

function) of gene product, identifying the biology process towhich the gene or gene product contribute (including information

gene).

12

8/3/2019 Bibm11 b223 Slides(2)

13/49

Homology-based functional analysis(Richter and

Huson, 2009)

omo ogy- ase approac as een recent y ntro uce to ac evefunctional annotation for metagenomic reads (Richter and Huson,

2009).

The framework begins with a homology based BLASTX algorithm to

in NCBI database.

The BLASTX hits will associate fragments with related protein IDand gene names. After that, with the help of the Gene Ontology

GO terms, thus provides an overview of gene function and productsfor metagenomic fragments.

13

8/3/2019 Bibm11 b223 Slides(2)

14/49

Homology-based functional analysis(Richter and

Huson, 2009)

GO terms obtained from database identifier ma in Richter and Huson 2009

14

8/3/2019 Bibm11 b223 Slides(2)

15/49

Limitations with Homology-based Functional

Analysis Methods

1. omo ogy- ase approac es very muc rep y on t e resu t o ocasequence alignment (such as BLAST and BLASTX) to the known

open reading frames (ORF). The BLAST-like local alignment may either return hundreds of hits, or return no

hits, depending on the threshold of E-value used. In the latter case, the currentmethods are unable to provide any functional annotation. In the former case, itusua y ac s o a proper e- rea er o ur er re uce e s, w c ma es e

functional annotation some how ambiguous (with hundreds of probableexplanation)

. -any insight about the major functional capabilities of genomes

(like which gene functions are more commonly shared by strainsrom e same spec es , as ere s no pr or y or e anno a eterms.

15

8/3/2019 Bibm11 b223 Slides(2)

16/49



16

8/3/2019 Bibm11 b223 Slides(2)

17/49

To ic ModelinTo ic Modelin -- IntuitiveIntuitive

Of all the sensory impressions proceeding to

the brain the visual ex eriences are thedominant ones. Our perception of the worldaround us is based essentially on the

messages that reach the brain from our eyes.For a lon time it was thou ht that the retinal Assume the data weimage was transmitted point by point to visual

centers in the brain; the cerebral cortex was amovie screen, so to speak, upon which theima e in the e e was ro ected. Throu h the

sensory, brain,visual, perception,

retinal, cerebral cortex,

some parameterizedrandom rocess.

discoveries of Hubel and Wiesel we nowknow that behind the origin of the visualperception in the brain there is a considerably

more com licated course of events. B

eye, cell, opticalnerve, image Learn the parameters

that best explain thefollowing the visual impulses along their pathto the various cell layers of the optical cortex,

Hubel and Wiesel have been able to

demonstrate that the messa e about the

data.

Use the model toimage falling on the retina undergoes a step-

wise analysis in a system of nerve cells

stored in columns. In this system each cell

has its s ecific function and is res onsible for

predict (infer) newdata, based on data

a specific detail in the pattern of the retinal

image.

.17

8/3/2019 Bibm11 b223 Slides(2)

18/49

Basic unit.

Item from a vocabulary indexed by {1, . . . ,V}.

Document = , , , . . . , .

Collection o a o ocumen s, eno e y = w ,w , . . . ,w .

To ic Denoted by z, the total number is K. Each topic has its unique word distribution p(w|z)

18

8/3/2019 Bibm11 b223 Slides(2)

19/49

Background & Existing Techniques of Generative

Latent Topic ModelsLikelihood of

*

topic z

Word-Topic Prior Probability

The probabilistic latent semantic indexing (PLSI) model

Assumption:

Each document has a mixture ofktopics.

Fitting the model involves:

Estimating the topic specific word

distributions w z and document s ecific

PLSI Model (Hoffman, 2001)topic distributionsp(z

k

|dj

) from the corpse

via maximum likelihood estimation (MLE).19

8/3/2019 Bibm11 b223 Slides(2)

20/49

Latent Dirichlet Allocation LDA Model Blei, 2003

In PLSI model, the topic mixtured~Dir()

k j documents are fixed once themodel is estimated. For newcomin document the modelneeded to be re-estimated. Thusit is not scalable.

( | ) ~ ( )j d p z d Multi

The LDA model treats theprobability of latent topics foreach documentp(z|d) and the

( | ) ~ ( )j ji p w z Multi

for each latent topicp(w|z) aslatent random variables which

are sub ect to chan e when new

~ ( )j Dir

document comes., ,

.

, ,.

( | , , )

wi d

i j i j

wi i d

i j i

n np z j w

W n T n

+ +=

+ +-i -wiw z

20

8/3/2019 Bibm11 b223 Slides(2)

21/49

LDA Model Estimation - Gibbs Sampling

Monte Carlo process (Griffiths, 2004)

( | , , ) ( | , , ) ( | , )wi i i wip z j w p w z j p z j= = =-i -wi -i -wi -i -wiw z w z w z

,

.( | , , ) ( | , , , ) ( | , )

wi

i j j j j

i wi i

np w z j p w z j p d

+= = = =-i -wi -i -wi -i -wiw z w z w z

,

,.

( | , ) ( | ) ( | , )

d

i jd d dd

i

n p z j p z j p d T n

+= = = =+-i -wi -i -wiw z w z

,i j

( | , , , )j jip w z j = =-i -wiw z

j j j

( | , ) ( , | ) ( )d d d p p p -i -wi -i -wiw z w z

in which

Since

and

, ,-i -wi -i -wi

( , | ) ~ ( )j j p Multi -i -wiw z

( , | ) ~ ( ) p Multi -i -wi

w z

( ) ~ ( )d p Dir

and . It follows that We have( ) ~ ( )j p Dir

,( | , ) ~ ( ) j wi

i j p Dir n +-i -wiw z,( | , ) ~ ( )

d d

i j p Dir n +-i -wiw z21

8/3/2019 Bibm11 b223 Slides(2)

22/49

-

Given the word-topic posterior probability, the Monte

Carlo process becomes really straightforward, which

each facet to appear) to determine the assignment ofto ics to each words for the next round.

Given probability for each word:

, , , ...wi i -i -wi

New topic assignment for each word.

22

8/3/2019 Bibm11 b223 Slides(2)

23/49

23

8/3/2019 Bibm11 b223 Slides(2)

24/49

24

8/3/2019 Bibm11 b223 Slides(2)

25/49

Experiments

25

8/3/2019 Bibm11 b223 Slides(2)

26/49

Experiment: Inferring Functional Groups from

Microbial Gene Catalogue with Topic Models

,non-redundant CDs catalogue, we show that the configuration of

functional groups in meta-genome samples can be inferred by.

The robabilistic to ic modelin is a Ba esian method that is able to

extract useful topical information from unlabeled data. When used tostudy microbial samples the functional elements (including

,KEGG pathway mappings) bear an analogy with words.

Estimating the probabilistic topic model can uncover theconfiguration of functional groups (the latent topic) in each sample.Which ma be further used to stud the enot e- henot econnection of human disease.

26

8/3/2019 Bibm11 b223 Slides(2)

27/49

Ex erimental Data Collection

In our experiment, we conduct a probabilistic topic modelingex eriment to identif functional rou s from human ut microbialcommunity data is generated by [Qin, et al. 2010], which is openlyaccessible via http://gutmeta.genomics.org.cn/

The human gut microbial samples from[Qin, et al. 2010] belong to bothhealth sub ects HS and atients with

inflammatory bowel disease (IBD).Specifically, the IBD patients are from,

Crohns disease (CD), and the othergroup with ulcerative colitis (UC).

In total, there are 85 healthy samples,15 UC samples and 12 CD samples.

27

8/3/2019 Bibm11 b223 Slides(2)

28/49

Experimental Data Collection (continue)

Accordin to Qin et al. 2010 the Illumina GA reads from humangut microbial samples are firstly assembled into longer contigs. Afterthat, the Glimmer program was used to predict protein-encoding

.

The predicted CDs sequences were then aligned to each other andform a non-redundant CDs catalog (a.k.a. minimal gut genome). Thenon-redundant CDs catalog consists of 3,299,822 non-redundantCDs se uences with an avera e len th of 704 b .

CDs_id: MH0001

Name: GL0006996 MH0001 [Lack 3'-end] [mRNA] locus=scaffold96 9:1:1206:-_ _ _ _ _ _ Length: 1206

COG/KO: COG4799 K01966

Pathway maping: map00280,map00640

28

axonom c eve : spec es - u ac er um e gens

8/3/2019 Bibm11 b223 Slides(2)

29/49

Experimental Data Collection (continue)

In our experiment, three types of functional elements are derivedfrom the non-redundant CDs catalog, i.e. the NCBI taxonomic levelindicators, indicator of gene orthologous groups and KEGG pathwayindicators.

- ,obtained by carrying out BLASTP alignment against the NCBI NRdatabase. The taxonomical level of each non-redundant CDs

based algorithm. The taxonomic abundance data for each samplecan be computed by counting the indicators of NCBI taxonomicaleve s.

The assignments of gene orthologous indicator and KEGG pathway

indicator are achieved b BLASTP ali nment of the amino-acidsequence from predicted CDs to the eggNOG database and KEGGdatabase.

29

8/3/2019 Bibm11 b223 Slides(2)

30/49

Ex erimental Data Collection continueGenus Clostridium

Genus BacteroidesNCBI Taxonomic Levels

Class Clostridia

Genus Bacillus

COG0642 : Signal transduction histidine kinase

COG1132 : "ABC-type multidrug transport system, ATPase and permease

components"

Indicators

COG0438 : Glycosyltransferase

map00230 : Metabolism_Nucleotide Metabolism_Purine metabolismKEGG Pathway Indicators

The union of unique functional elements jointly defines a fixed word

_ _

map00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism

voca u ary. n o a , ere are , axonom c eveindicators, with a vocabulary size of 748; there are a total of1,293,764 gene orthologous group indicators, with a vocabulary sizeof 4667; and there are 953,493 KEGG pathway indicators, with a

vocabulary size of 237. 30

8/3/2019 Bibm11 b223 Slides(2)

31/49

Groups of functional elements in microbial

commun ty

Given non-redundant CDs catalog, and derived functional elements,-

31

functional elements (a.k.a. functional groups).

8/3/2019 Bibm11 b223 Slides(2)

32/49

Generative rocess of ro osed model

Commonly shared functional elements across samples may suggestfunctional similarit and biolo ical relevance amon sam les. Tocover such information, a genome-wide background distribution offunctional elements need to be estimated, which leads to the

0 .

32

8/3/2019 Bibm11 b223 Slides(2)

33/49

Illustration of the background topic of gene

Background Topic - Indicator of Gene OGs

Gene OGs Indicator Descriptions Probability

COG0463Glycosyltransferases involved in cell

wall biogenesis

0.00813

.

COG0582 Integrase 0.00698

COG1132ABC-type multidrug transport system,

"0.00689

COG0438 Glycosyltransferase 0.00664

COG0745

Response regulators consisting of a

CheY-like receiver domain and a 0.00644

winged-helix DNA-binding domain

COG1396 Predicted transcriptional regulators 0.00595

COG0577ABC-type antimicrobial peptide

0.00594ranspor sys em, permease componen

COG2207AraC-type DNA-binding domain-

containing proteins0.00389

COG3250 Beta- alactosidase/beta- lucuronidase 0.00344

33

8/3/2019 Bibm11 b223 Slides(2)

34/49

Illustration of the background topic of KEGG

w yBackground Topic - KEGG Pathway Indicator

w

map00230Metabolism_Nucleotide Metabolism_Purine

metabolism0.0333

Metabolism_Carbohydrate Metabolism_Fructoseand mannose metabolism

.

map00500Metabolism_Carbohydrate Metabolism_Starch and

sucrose metabolism0.0260

Metabolism Nucleotide Metabolism P rimidinemap00240

_ _

metabolism0.0222

map00350Metabolism_Amino Acid Metabolism_Tyrosine

metabolism0.0221

map00260e a o sm_ m no c e a o sm_ yc ne,

serine and threonine metabolism"0.0220

map00010Metabolism_Carbohydrate Metabolism_Glycolysis /

Gluconeo enesis0.0190

map00620Metabolism_Carbohydrate Metabolism_Pyruvate

metabolism0.0176

ma 00251Metabolism_Amino Acid Metabolism_Glutamate

0.0169

map00550Metabolism_Glycan Biosynthesis and

Metabolism_Peptidoglycan biosynthesis 0.0168 34

U d l i i h

8/3/2019 Bibm11 b223 Slides(2)

35/49

Uncovered latent topics with respect to

xIllustration of the most relevant latent to ics with

respect to different taxa

Topic ID MI Score Topic ID MI Score Topic ID MI Scoream y_ n er

obacteriaceae Topic 48 0.02476 Topic 121 0.00915 Topic 31 0.00279

genus_Clostri

dium Topic 50 0.01628 Topic 153 0.01001 Topic 95 0.00765

genus_ ac er

oides Topic 156 0.03030 Topic 77 0.02018 Topic 52 0.01661phylum_Bact

eroidetes Topic 132 0.00476 Topic 165 0.00260 Topic 67 0.00257

p y um_ rm

icutes Topic 0 0.01256 Topic 99 0.00550 Topic 193 0.00212

,information score (MI score). The MI severs as a relevance measurementbetween taxa and latent topics. It shows that phylum Firmicutes is most relevant

. ,Topic 50, 153, 95 and genus Bacteroides is most relevant to Topic 156, 77, 52.

35

U d l t t t i ith t t

8/3/2019 Bibm11 b223 Slides(2)

36/49


xIllustration of top-ranked latent topics with respect

MH0001

p(topic|sampl

e) O2.UC-1

p(topic|sampl

e) V1.CD-1

p(topic|sampl

e)

o eren m cro a samp es

Topic 0 0.475 Topic 0 0.363 Topic 0 0.286







Discoveries : the probability of Topic 0 in Healthy and UC samples (0.475 inMH0001 and 0.363 in O2.UC-1) is much higher than that in CD samples (0.286 inV1.CD-1). This suggests that for CD samples, the proportion of bacteria belong tophylum Firmicutes is significantly reduced. The prevalence of Topic 95 and 52 insamples O2.UC-1 and sample V1.CD-1 may indicate the existence and possibly

high abundance of genus Clostridium and genus Bacteroides, correspondingly.36

U d l t t t i ith t t

8/3/2019 Bibm11 b223 Slides(2)

37/49


x

37

8/3/2019 Bibm11 b223 Slides(2)

38/49

Our discoveries from the results is evidenced by the recent

scover es n eca m cro o a s u y o n amma ory owe sease(IBD) patients [Gerber, 2007], [Harry S. et. al. 2006], [Manichanh C

et al., 2006], [Walker A. et. al. 2011].

It has been reported that there is a significant reduction in the

samples, which is consistent with our results.

This can be explained by the fact mucosal microbial diversity isreduced in IBDs, particular in CD, which is associated with bacterial

superficial; therefore, the reduction of phylum Firmicutes in UC is notsignificant.

38

8/3/2019 Bibm11 b223 Slides(2)

39/49

Based on the functional elements derived from the non-

redundant CDs catalogue, we have shown that theconfiguration of functional groups encoded in the gene-

-applying probabilistic topic modeling to functional elementsderived from the non-redundant CDs catalogue.

The latent topics estimated from human gut microbial samples

study, which demonstrate the effectiveness of the proposed

method.

39

8/3/2019 Bibm11 b223 Slides(2)

40/49

In the proposed model, the number of functional group has to

be specified in advance, or iteratively tuned by criteria such aslog-likelihood and perplexity.

,Bayesian models (such as HDP model) to handle theuncertainty in the number of functional groups, which providethe flexibility of modeling microbial sequences with unknownfunctional group numbers.

40

8/3/2019 Bibm11 b223 Slides(2)

41/49

uest ons

41

8/3/2019 Bibm11 b223 Slides(2)

42/49

Backup Slides

42

M l I f i

8/3/2019 Bibm11 b223 Slides(2)

43/49

Mutual Information

After estimating the topic model and assigning a latent topic to each,

functional element indicators (i.e. NCBI taxonomic level indicators,

indicator of gene orthologous groups and KEGG pathwayn ca ors can e o a ne y ca cu a ng e mu ua n orma on(MI) between functional element indicators and obtained latenttopics based on the final latent topic assignments to functionalelements.

( , )( , ) ( , )log

g t

g t g t

p R ZMI R Z p R Z =

in which Rg

and Ztare binary indicator variables corresponding to

t

the functional element and the latent topic, respectively. Thevariable pair (Rg,Zt) indicates whether a latent topic has beenassigned to a specific functional element.

43

Lik lih d C i

8/3/2019 Bibm11 b223 Slides(2)

44/49

Likelihood Comparison

( | ) ( | , ) ( | )t t t

zt

T

t z z t z p p z p z d

= w z w

( ) ( )

0

( ) ( )1 0

( ) ( )( ) ( ). .

( ) ( ) ( ) ( )

i i

i i

w wTT

tw w

W Wt t

n nW W

n W n W

=

+ + =

+ +

44

Lik lih d C i ( ti )

8/3/2019 Bibm11 b223 Slides(2)

45/49

Likelihood Comparison (continue)

( | ) ( | , ) ( | )t t t

zt

T

t z z t z p p z p z d

= w z w

( ) ( )

0

( ) ( )1 0

( ) ( )( ) ( ). .

( ) ( ) ( ) ( )

i i

i i

w wTT

tw w

W Wt t

n nW W

n W n W

=

+ + =

+ +

45

P l it C i

8/3/2019 Bibm11 b223 Slides(2)

46/49

Perplexity Comparison

The perplexity is calculated for held-out testing data. In ourex eriment we use a 50% subset of the functional elements astraining data and the other 50% as testing data.

,from the same sample are equally split to both subsets. In practice, itis the inverse predicted model likelihood of data in held-out testing

, .

smaller perplexity value indicates better model fitting.

1

1

log( ( ))( ) exp

test

test

D

j

test D t

jj

p perplexity D

N

=

=

=

jw

46

P l it C i ( ti )

8/3/2019 Bibm11 b223 Slides(2)

47/49

Perplexity Comparison (continue)

47

Dirichlet Process (DP) as a Non-Parametric Mixture Models

8/3/2019 Bibm11 b223 Slides(2)

48/49

c et ocess ( ) as a o a a et c tu e ode s

G0 ~ DP(,H), in which is a concentration parameter andHis a base measure definedon a sample space . By its definition, for any finite measurable partition of: {A1,

G G ~ Dirichlet H H .

1kDirichlet Process can also be constructed by stick-breaking construction as follows:

0

1

k k

k=

=1

, ,k k i k

i=

Dirichlet process Dirichlet rocess constructed b stick-breakinby its definition: construction:

- Data samplexi drawn from a base distribution with associated parameters k

48,in which

The weights of mixture components = {k} (k=1,,) are also refer to as ~ GEM().

Hierarchical Dirichlet Process (HDP)

8/3/2019 Bibm11 b223 Slides(2)

49/49

( )~ 0 ,

measure across the corpora and defines a set of child random probability measures Gj ~DP(0, G0) for each documentj, which leads to different document-level distribution over

semantic mixture com onents: G A G ~ Dirichlet G G

Each Gj can also be constructed by stick-breaking construction as:

1

( ) j jk k k

G

=

= in whch j={jk} (k=1,,) specifies the weights of mixture component indicatork.

Substitute the stick-breaking construction ofG0 and Gj,

1 1

0 0,..., ~ ( ,..., )r r

jk jk k k

k K k K k K k K

Dirichlet

Based on the aggregation properties of Dirichlet distribution

and its connection with Beta distribution, it shows that:1k k

0 0

11

' (1 ' ), ' ~ , 1 jk jk jl jk k l ll

Beta ==

=

It then follows that j~ DP(0, ) Stick-breaking construction of

49

hierarchical Dirichlet process

bibm11 b223 slides(2)

Documents

slides (2)

2. shoulder slides

chapter 2 slides

msd-b223-mdl-ms-gf-ent-dr-d01-701.0_r0-main door

slides photos 2

module 2 slides

slides 1 › ... › diapositivas.pdf · slides 1. slides 2...

english slides (2)

nike slides 2

challenges – 2 slides opportunities – 4 slides how ? - ...

family feud b223 sports marketing. famous families...

epc slides 2

brm slides (2)

test slides 2

tornado industries, llc 3101 wichita court fort … ck lw...

slides task 2

ning slides 2

chapter 1 slides - 2 slides per page

assignment 2- slides

sample slides 2