module 3 sequence and protein analysis (using web-based tools) working with pathogen genomes -...
TRANSCRIPT
Module 3 Sequence and Protein Analysis(Using web-based tools)
Working with Pathogen Genomes - Uruguay 2008
PrimaryDNA sequence
Dotter BlastN BlastX
Gene finders
tRNA scan
Repeats Pseudo-genesrRNACDSs
tRNA
Preannotationmanual curation
PrimaryDNA sequence
Dotter BlastN BlastX
Gene finders
tRNA scan
Repeats Pseudo-genesrRNACDSs
tRNA
Fasta BlastP Pfam Prosite Psort SignalP TMHMM
PreannotationManual curation
Manual curation
Annotatedsequence
Annotation of Protein-coding genes: (from gene model to protein function)
-search programs: local (BLAST) and global (FASTA) alignments, EST hits
-Protein domains and motifs: InterPro (Pfam, Prosite, SMART etc.)
-Transmembrane / signal peptide prediction (TMHMM, SignalP, Phobius)
- Base annotation on characterised proteins where possible (manually curated SWISSPROT entry)
-Read the literature (PUBMED)
Use several lines of evidence!
Annotation of non-protein-coding genes: (tRNAs, rRNAs, snRNAs, other ncRNAs)
-Initial searches:-BlastN, GC-plots-tRNA scan-sno scan-Others
-Search in specialised databases:-Rfam scan-microRNAdb etc.
-Comparative ncRNA prediction tools: -RNAZ-Evofold-QRNA etc.
-Structure prediction of ncRNAs:- MFOLD-Others
Use several lines of evidence!
Structural conservation of ncRNAs!
Statistical significance of database hitsE-values (Expectation value)
E-value = No alignments with the equivalent score that you would expect to find by random chance.
An e-value of 5 would mean that you would expect 5 alignments with the equivalent or higher score to have occurred by random chance
more reliable than the % ID
Caution: Repeat regions / non-curated protein sequences
Sequence similarity searching:BLAST (Basic Local Alignment Search Tool) analysis:
Nucleotide sequences:
blastn: nucleotide sequence compared to nucleotide database
blastx: nucleotide sequence translated and all 6 frame translations compared
to protein database
tblastn: protein query vs translated database
Protein sequences
blastp: protein query vs protein database
tblastx: translated query vs translated database (all 6 frames)
FastA:
Provides sequence similarity and homology searching against nucleotide and protein
databases using the Fasta programs. Fasta can be very specific when identifying long
regions of low similarity especially for highly diverged sequences.
Orthologues and paralogues
Human hemoglobin
Mouse hemoglobin
Human hemoglobin
Human myoglobin
orthologues paralogues
Originate from gene duplicationDiverged functions
Originate from evolutionSimilar functions
Best tool to look for orthologues? Blast or FastA?
FastA!
A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. An HMM can be considered as the simplest dynamic Bayesian network.
WHAAAAT???
HMMs
..HMPLKHRLHP..
..RMPLKHRPHP..
..GMRLKHRHHP..
..PMGLKHAGHP..Profile
aligned sequences
..-MPLKHR-HP..
HMM for the alignedmotif that can be usedto search databasesfor proteins containingthis motif
• FastA• Blast• Psi-blast• HMM searches• HMM-HMM comparison: HHPred server
http://toolkit.tuebingen.mpg.de/hhpred
Remote homology detection
Psi-blast
• • • Psi-blast• HMM searches•
..-MPLKHR-HP..
Create HMM
Search database with HMM
..RMPLKHRFHP..
..PMPLKHRIHP..
..HMPLKHDVHP..
..YMDLKHELHP..
..-MPLKHR-HP..• • • • • HMM-HMM comparison: HHPred server
http://toolkit.tuebingen.mpg.de/hhpred
Psi-blast
HMM building
HMM-HMM comparison
Alignment
Secondary structure prediction
Secondary structure comparison
Extremely sensitiveremote homology detection
3D structure modelling
Input protein sequence
Module 3 Exercises:Section A:
• Sequence retrieval of a P. falciparum protein (cyclophilin) using SRS• BLAST and Fasta searches by cutting & pasting the sequence.
Section B:Exercise 1 Part I:
• Search PROSITE server by cutting & pasting the cyclophylin sequenceExercise 1 Part II:
• Pfam serverExercise 1 Part III:
• SMART serverExercise 1 Part IV:
• InterPro serverExercise 2:
• Sequence retrieval of P. falciparum PFC0125w protein using SRS. • TMHMMv2.0 server. • SignalPv3.0 server.
Section C:
• Other web resources