beiko hpcs
DESCRIPTION
Presentation at HPCS 2014, HalifaxTRANSCRIPT
![Page 1: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/1.jpg)
(an example of)
Computing the Microbial World
Rob BeikoJune 25, 2014
![Page 2: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/2.jpg)
![Page 3: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/3.jpg)
Siddique et al. (2014) Front Microbiol
![Page 4: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/4.jpg)
Lawley et al., PLoS Genet (2012)
![Page 5: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/5.jpg)
![Page 6: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/6.jpg)
The Breakfast Organisms"Bacon Fields" Author: Michael DeForge
![Page 7: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/7.jpg)
240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides
~40 GB
Hundreds of “species”Genomes between 1.5M – 6M nucleotides
![Page 8: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/8.jpg)
150 nt x 150 nt
We know this And this
But not this
![Page 9: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/9.jpg)
who is doing what?
Marker genes WHO
Environmental “Shotgun” WHAT
The challenge ofMETAGENOME CLASSIFICATION
![Page 10: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/10.jpg)
Clues – Sequence similarity(homology)
150 nt x 150 nt
Referencegenes
Take the WHOLE SEQUENCE
Best
Worst
![Page 11: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/11.jpg)
Clues – composition150 nt x 150 nt
Referencegenome
k-mer profiles
Genome #1:20% G & C30% A & T
Genome #2:24% G & C26% A & T
Best
Worst
Take a K-MER FREQUENCY
DECOMPOSITION
![Page 12: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/12.jpg)
Homology >> Composition
* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA
But homology evidence canmislead or be absent
Homology + Composition > Homology alone
![Page 13: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/13.jpg)
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Query:
Subject:
Exact string search? NO
BLAST? OK, but SLOW!
![Page 14: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/14.jpg)
A compromise: UBLAST
• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB
sequences, but most will return no hits
UBLAST:GGCTGGACCA
GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(1) Query, DB sequences
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(3) Rank DBbased on k-mer
matching
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(4) Do detailed searchuntil there is
no more point
X
(2) k-mer table
![Page 15: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/15.jpg)
Compositional models• Interpolated Markov models: adaptively generate
frequency models based on extending k-mers with sufficiently high frequencies
• One model per genome• Evaluate probability of each k-mer in query sequence,
given shorter k-mers in sequence• Model construction can take a while
k = 4 k = 5 k = 6 k = 7
PhymmBL: Brady and Salzberg (2009) Nat Methods
![Page 16: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/16.jpg)
An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed
length k
• Build one frequency model for each genome
• FAST• Assumes conditional independence – may not matter
Probability of a query Fragment originating from genome Gi
For all k-mers in the fragment…
The frequency of that k-mer in Gi
Parks et al. (2011) BMC Bioinformatics
![Page 17: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/17.jpg)
RITA: Rapid Identification of Taxonomic Assignments
UBLAST filter
MacDonald et al. (2012) Nucleic Acids Res
![Page 18: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/18.jpg)
Evaluation set
• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials
• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so
hard• Leave out other classes in the same phylum - HARD
![Page 19: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/19.jpg)
![Page 20: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/20.jpg)
But does it work?
Full RITA
Best class (homology and composition agree)
DNA sequence length50
Predicting genus from different species Predicting phylum from different class
![Page 21: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/21.jpg)
Conclusions
• Careful attention needs to be paid to the choice of approach – simple is better
• RITA illustrates two key points in (microbial) bioinformatics:
1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise
• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years
![Page 22: Beiko hpcs](https://reader036.vdocuments.us/reader036/viewer/2022062419/5591a32a1a28ab4c6d8b457d/html5/thumbnails/22.jpg)
FIN