beiko hpcs

(an example of)

Computing the Microbial World

Rob BeikoJune 25, 2014

Siddique et al. (2014) Front Microbiol

Lawley et al., PLoS Genet (2012)

The Breakfast Organisms"Bacon Fields" Author: Michael DeForge

240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides

~40 GB

Hundreds of “species”Genomes between 1.5M – 6M nucleotides

150 nt x 150 nt

We know this And this

But not this

who is doing what?

Marker genes WHO

Environmental “Shotgun” WHAT

The challenge ofMETAGENOME CLASSIFICATION

Clues – Sequence similarity(homology)

150 nt x 150 nt

Referencegenes

Take the WHOLE SEQUENCE

Best

Worst

Clues – composition150 nt x 150 nt

Referencegenome

k-mer profiles

Genome #1:20% G & C30% A & T

Genome #2:24% G & C26% A & T

Best

Worst

Take a K-MER FREQUENCY

DECOMPOSITION

Homology >> Composition

* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA

But homology evidence canmislead or be absent

Homology + Composition > Homology alone

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Query:

Subject:

Exact string search? NO

BLAST? OK, but SLOW!

A compromise: UBLAST

• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB

sequences, but most will return no hits

UBLAST:GGCTGGACCA

GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(1) Query, DB sequences

GGCTGGACCA


(3) Rank DBbased on k-mer

matching

GGCTGGACCA


(4) Do detailed searchuntil there is

no more point

X

(2) k-mer table

Compositional models• Interpolated Markov models: adaptively generate

frequency models based on extending k-mers with sufficiently high frequencies

• One model per genome• Evaluate probability of each k-mer in query sequence,

given shorter k-mers in sequence• Model construction can take a while

k = 4 k = 5 k = 6 k = 7

PhymmBL: Brady and Salzberg (2009) Nat Methods

An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed

length k

• Build one frequency model for each genome

• FAST• Assumes conditional independence – may not matter

Probability of a query Fragment originating from genome Gi

For all k-mers in the fragment…

The frequency of that k-mer in Gi

Parks et al. (2011) BMC Bioinformatics

RITA: Rapid Identification of Taxonomic Assignments

UBLAST filter

MacDonald et al. (2012) Nucleic Acids Res

Evaluation set

• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials

• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so

hard• Leave out other classes in the same phylum - HARD

But does it work?

Full RITA

Best class (homology and composition agree)

DNA sequence length50

Predicting genus from different species Predicting phylum from different class

Conclusions

• Careful attention needs to be paid to the choice of approach – simple is better

• RITA illustrates two key points in (microbial) bioinformatics:

1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise

• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years

beiko hpcs

Technology