module 3 sequence and protein analysis (using web-based tools) working with pathogen genomes -...

Module 3 Sequence and Protein Analysis(Using web-based tools)

Working with Pathogen Genomes - Uruguay 2008

PSU Projects

Organism

Annotated genome

Finished genome

Database entry

Artemis&

ACT

Annotation using Artemis: mapping domains in proteins

PrimaryDNA sequence

Dotter BlastN BlastX

Gene finders

tRNA scan

Repeats Pseudo-genesrRNACDSs

tRNA

Preannotationmanual curation

PrimaryDNA sequence

Dotter BlastN BlastX

Gene finders

tRNA scan

Repeats Pseudo-genesrRNACDSs

tRNA

Fasta BlastP Pfam Prosite Psort SignalP TMHMM

PreannotationManual curation

Manual curation

Annotatedsequence

Gene model annotation Protein function

Annotation of Protein-coding genes: (from gene model to protein function)

-search programs: local (BLAST) and global (FASTA) alignments, EST hits

-Protein domains and motifs: InterPro (Pfam, Prosite, SMART etc.)

-Transmembrane / signal peptide prediction (TMHMM, SignalP, Phobius)

- Base annotation on characterised proteins where possible (manually curated SWISSPROT entry)

-Read the literature (PUBMED)

Use several lines of evidence!

Annotation of non-protein-coding genes: (tRNAs, rRNAs, snRNAs, other ncRNAs)

-Initial searches:-BlastN, GC-plots-tRNA scan-sno scan-Others

-Search in specialised databases:-Rfam scan-microRNAdb etc.

-Comparative ncRNA prediction tools: -RNAZ-Evofold-QRNA etc.

-Structure prediction of ncRNAs:- MFOLD-Others

Use several lines of evidence!

Structural conservation of ncRNAs!

Statistical significance of database hitsE-values (Expectation value)

E-value = No alignments with the equivalent score that you would expect to find by random chance.

An e-value of 5 would mean that you would expect 5 alignments with the equivalent or higher score to have occurred by random chance

more reliable than the % ID

Caution: Repeat regions / non-curated protein sequences

Sequence similarity searching:BLAST (Basic Local Alignment Search Tool) analysis:

Nucleotide sequences:

blastn: nucleotide sequence compared to nucleotide database

blastx: nucleotide sequence translated and all 6 frame translations compared

to protein database

tblastn: protein query vs translated database

Protein sequences

blastp: protein query vs protein database

tblastx: translated query vs translated database (all 6 frames)

FastA:

Provides sequence similarity and homology searching against nucleotide and protein

databases using the Fasta programs. Fasta can be very specific when identifying long

regions of low similarity especially for highly diverged sequences.

(Global)FASTA

BLAST(Local)

Orthologues and paralogues

Human hemoglobin

Mouse hemoglobin

Human hemoglobin

Human myoglobin

orthologues paralogues

Originate from gene duplicationDiverged functions

Originate from evolutionSimilar functions

Best tool to look for orthologues? Blast or FastA?

FastA!

Functional assignment: alignments of modular proteins

A

B

A

B

C

A

B

C

A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. An HMM can be considered as the simplest dynamic Bayesian network.

WHAAAAT???

HMMs

..HMPLKHRLHP..

..RMPLKHRPHP..

..GMRLKHRHHP..

..PMGLKHAGHP..Profile

aligned sequences

..-MPLKHR-HP..

HMM for the alignedmotif that can be usedto search databasesfor proteins containingthis motif

• FastA• Blast• Psi-blast• HMM searches• HMM-HMM comparison: HHPred server

http://toolkit.tuebingen.mpg.de/hhpred

Remote homology detection

Psi-blast

• • • Psi-blast• HMM searches•

..-MPLKHR-HP..

Create HMM

Search database with HMM

..RMPLKHRFHP..

..PMPLKHRIHP..

..HMPLKHDVHP..

..YMDLKHELHP..

..-MPLKHR-HP..• • • • • HMM-HMM comparison: HHPred server

http://toolkit.tuebingen.mpg.de/hhpred

Psi-blast

HMM building

HMM-HMM comparison

Alignment

Secondary structure prediction

Secondary structure comparison

Extremely sensitiveremote homology detection

3D structure modelling

Input protein sequence

Module 3 Exercises:Section A:

• Sequence retrieval of a P. falciparum protein (cyclophilin) using SRS• BLAST and Fasta searches by cutting & pasting the sequence.

Section B:Exercise 1 Part I:

• Search PROSITE server by cutting & pasting the cyclophylin sequenceExercise 1 Part II:

• Pfam serverExercise 1 Part III:

• SMART serverExercise 1 Part IV:

• InterPro serverExercise 2:

• Sequence retrieval of P. falciparum PFC0125w protein using SRS. • TMHMMv2.0 server. • SignalPv3.0 server.

Section C:

• Other web resources

module 3 sequence and protein analysis (using web-based tools) working with pathogen genomes -...

Documents