hmmer 3 & community profiling
Post on 27-May-2015
1.946 Views
Preview:
DESCRIPTION
TRANSCRIPT
HMMER 3 &COMMUNITY
PROFILING
Morgan Langille
UC Davis
HMMER 3 – What’s new?
Much Faster100 X HMMER 2≈ BLAST
More sensitive
What’s new?
Alignment column confidence Each residue is given a posterior
probability annotation
* = 95-100% 9= 85-95% 8= 75-85% etc.
fn3 2 saPenlsvsevtstsltlsWsppkdgggpitgYeveyqekgegeewqevtvprtttsvtltgLepgteYefrVqavngagegp 84 saP ++ + ++ l ++W p + +gpi+gY++++++++++ + e+ vp+ s+ +++L++gt+Y++ + +n++gegp7LESS_DROME 439 SAPVIEHLMGLDDSHLAVHWHPGRFTNGPIEGYRLRLSSSEGNA-TSEQLVPAGRGSYIFSQLQAGTNYTLALSMINKQGEGP 520
78999999999*****************************9998.**********************************9997 PP
What’s new?
Sequence scores, not alignment scoresscoring just a single best alignment can
break down if it is a remote homologscoring sequences by integrating over
alignment uncertainty
Single Sequence Queries phmmer ≈ BLASTP
Search a sequence against a sequence database.
jackhmmer ≈ PSI-BLASTIteratively search a sequence against a
sequence database.
Internally they produce a profile HMM from the query sequence then run an HMM search
Small Changes hmmpfam -> hmmscan
Search a sequence against a profile HMM database
hmmcalibrate -> built into hmmbuild
hmmpress Creates binary hmm files so hmmscan is faster Similar idea to formatting Blast db’s using formatdb
New output format options --tblout (seq score, best domain score) --domtblout (seq score, all domain scores with coordinates) Gives a tab-delimited output without alignments 1/5 file size of regular output
Upcoming changes
ParallelizationMulti-threaded, MPI (cluster), GPU
Translated comparisonsBLASTX, TBLASTN, TBLASTX
More input sequence formatsGenBank, EMBL, etcClustal format
Problems/Issues
hmmconvertUsed to convert hmmer2 profiles into
hmmer3 profilesOnly converts file format
○ Good: get hmmer3 speedup ○ Bad: get hmmer2 sensitivity/specificity
Should rebuild old HMMER2 HMMs using hmmbuild
Glocal vs local alignments Local
Any portion of the HMM can align to any portion of the sequence Glocal
The entire HMM is aligned to any portion of the sequence
HMMER2 Had both, but local was not as sensitive as glocal
HMMER3 Local was improved so that glocal was thought to be not needed
(and was not included in HMMER3) However, some models do very poorly Short extremely diverse seed alignments such as zinc finger
transcription factors may be missed
Community Profiling
Phylogenetic profiling C. hydrogenoformans
identified presence or absence of homologs in all other completely sequence genomes
Identified many hypothetical proteins that had the same profile as other sporulation proteins
Wu, et al., PLOS Genetics, 2005
Community ProfilingKEGG COG
Delong, et al., Science, 2006
Community Profiling
Look across multiple metagenomic samples
Gene families that have similar profiles may have similar functionSimilar to using co-expression to identify
similar functioning genes
So what have I done? Downloaded the GOS peptide file
41M sequences, 80 samples 43GB -> 7GB, by removing extra information Split into ~100 smaller files
Downloaded HMMER 3 Pfams (email request) Containing 11098 Pfams
Ran hmmscan on genbeo 4 days later 12.5 M pfam predictions
○ Some sequences contain >1 pfam 9643 pfams
Used “cluster” to group genes and samples
Results Red = above avg.
number of pfams Green = below avg.
number of pfams Have not normalized
Number of sequences per sample
For number of pfams
GOS Metagenomic Samples
Pfams
Example of phage Pfams clustering together
Future Community Profiling
Include other (all) metagenomic samplesTry to group Pfams by GO category to see how strong
the correlation is between branch length and functionExamine if some functionality categories are more
easily predicted by this profiling strategy (i.e. HGTs)
Identify novel gene families and sub-familiesClustering genes, building HMMs, scanning, …repeat. Community profiling may help in annotation of these
top related