making sense of dna and protein sequence analysis tools (course #2) dave baumler genome center of...

22
Making Sense of DNA and protein sequence analysis tools (course #2) http://www.ncbi.nlm.nih.gov/Class/minicourses/ Dave Baumler Genome Center of Wisconsin, UW-Madison [email protected]

Upload: emerald-cobb

Post on 20-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Making Sense of DNA and protein sequence analysis tools (course #2)

http://www.ncbi.nlm.nih.gov/Class/minicourses/

Dave Baumler

Genome Center of Wisconsin, UW-Madison

[email protected]

Page 2: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Todays session an overview

You have been given a 5 KB piece of DNA sequence

GeneScan: find any exons in the DNA sequence and generate a predicted protein sequence

ScanProsite: scan the protein sequence for domains/motifs/patterns found in the prosite database

BLASTP: run a BLASTP search against the Swissprot database find some of the best matches (hits) and copy each protein sequence into a word doc for the alignment

MultAlin: conduct protein sequence alignments from the BLASTP search

Page 3: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

In this session you will try out 4 different tools, Lots of other tools exist

http://bioinformatics.ca/links_directory/

Page 4: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Where are the coding regions?TCAGCGAAGATGAGATAGTTTTTAAAGGTGGGATTTCCCCACCTTTAAAAAGCGAGAAGTCCCGGTTTTAAAGAGGAGTAAAATCCTCTTTTTCTAGCCCACTCAGGTGGTTTTTTTGGTTTTCGCTCCTTGCCGCATCTTCTGTGCCTTTGATGGCGGCTGGTTGGGGTGAAAGGCTGCATATTCCAGAATTTCAGACAGTAGATTGTTTTTGAAATCTTCCGTTTTATCGTTGACGAACTTAACCATCCTGTTGAAATCATCTTCCTTTGATACACCTTCAGGAAATGCCTTAGGAACTGATGTTTGGCTATCCAAGGCATCTTGCAATATCTGCACGATCTCCGAATTCATTGATCGCCCATTGGCCTTTGCTCTGGCGGCAACTGCGTCACGCATACCGTCAGGCATCCTAACTGTAAATCTCTCAATGAAAGCTGGATCTTCTTTTTCAGTCATCATCTTAAACCATAAAAATTTATACAAAACACACTAGCATCATATTGACATTACCCACAATGACATCATAATGGTGTCAGGCATCAAAATGATGTCATCATGACAAGGGGAAAGTAAATGCAAGATGTTCTCTATACAGGTCGTAAGAACGACAGCTTTCAGCTTCGTCTGCCTGAGCGAATGAAAGAAGAGATCCGTCGCATGGCAGAGATGGACGGCATTTCGATTAATTCTGCAATCGTGCAGCGCCTTGCTAAAAGCTTGCGTGAGGAAAGAGTTAATGGGCAGTAAAAACAGCGAAGCCCGGAAGTGTGGGGACACTAACCGGGCTTCTAATGTCAGTTACCTAGCGGGAAACCAACAATGACCAGTATAGCAATCTTTGAAGCAGTAAACACTATCTCTCTTCCATTCCACGGACAGAAGATCATAACTGCGATGGTGGCGGGTGTGGCGTATGTGGCAATGAAGCCCATCGTGGAAAACATCGGTTTAGACTGGAAGAGCCAGTATGCCAAGCTCGTTAGTCAGCGTGAAAAGTTCGGGTGTGGTGATATCACCATACCTACCAAAGGTGGTGTTCAGCAGATGCTTTGCATCCCTTTGAAGAAACTGAATGGATGGCTCTTCAGCATTAACCCAGCAAAAGTACGTGATGCAGTTCGTGAAGGTTTAATTCGCTATCAAGAAGAGTGTTTTACAGCTTTGCACGATTACTGGAGCAAAGGTGTTGCAACGAATCCCCGGACACCGAAGAAACAGGAAGACAAAAAGTCACGCTATCACGTTCGCGTTATTGTCTATGACAACCTGTTTGGTGGATGCGTTGAATTTCAGGGGCGTGCGGATACGTTTCGGGGGATTGCATCGGGTGTAGCAACCGATATGGGATTTAAGCCAACAGGATTTATCGAGCAGCCTTACGCTGTTGAAAAAATGAGGAAGGTCTACTGATTGGCGTATTGGAAGGCGCAAAAAGAAAAGCCAGCAGATGGGCTGCTGGCATTCATTGGGTATATGAACTTTCGGAGAACATATGAAGTCAATTATCAAGCATTTTGAGTTTAAGTCAAGTGAAGGGCATGTAGTGAGCCTTGAGGCTGCAAGCTTTAAAGGCAAGCCAGTTTTTTTAGCAATTGATTTGGCTAAGGCTCTCGGGTACTCAAATCCGTCA

Page 5: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Genemark.hmm a statistical model

Page 6: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Exon prediction in Eukaryotic DNA using Genescan: Net result is a protein sequence

GeneScan looks for start and stop codons, promoters, splice sites, polyA tails, provides statistics for coding potential

Page 7: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

GeneScan results

Page 8: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

GeneScan results

Page 9: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,
Page 10: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

I have a protein sequence, now what?

-Amos Bairoch, (creater of SWISS-PROT), created a collection of small well-conserved segments (patterns) to classify and analyze new proteins

-PROSITE is the name he gave to this pattern database

-PROSITE also contains profiles which describe every position of a protein family

-ScanProsite is a server that compares your protein to the PROSITE database

-if your protein contains a PROSITE pattern, it can give you a pretty clear indication of its function

Page 11: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

What does it look for on the protein sequence?

-profiles of protein families

-conserved patterns in the sequence ([RK]-x-[ST])

-cofactor binding motifs

-substrate binding motifs

ScanProsite:

Around the world there are ~8 other major collections of domains, such as Interproscan, CD server, or Pfscan

Page 12: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,
Page 13: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

ScanProsite results continued

Page 14: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Sequence Similarity Searches using BLAST

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

ATGGAACTGACTCCAAGAGAAAAAGACAAACTATTACTGTTTACCGCTGCACTGCTGGCAGAGCGTCGTCTGGCCCGCGGCCTGAAACTTAACTATCCCGAATCCGTGGCCCTGATTAGCGCTTTTATAATGGAGGGCGCTCGCGACGGCAAAAGCGTCGCTGCGCTGATGGAAGAAGGACGGCATGTCCTGAGTCGCGAGCAGGTCATGGAAGGCATACCAGAAATGATCCCCGATATCCAGGTCGAAGCCACCTTTCCGGACGGCTCCAAGCTGGTTACCGTCCATAATCCGATAATCTGA

-If you have a region of sequenced DNA, and you want to know what the protein encoded does

-If you can find similar sequences you can say, “if something is true for that sequence, it is probably true for mine as well.”

-could take years in the lab, can take only seconds to search a database for similarity

This is an unknown gene sequence used in the next few slides

Page 15: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

BLAST = Basic Local Alignment Search Tool“The most popular data mining tool ever”

BLASTN DNA sequence vs. DNA sequence database

BLASTP protein sequence vs. protein sequence database

BLASTX DNA sequence translated in 6 reading frames vs. protein sequence database

tBLASTX DNA sequence translated in 6 reading frames vs. DNA sequence database translated in 6 frames

The different types of BLAST

Page 16: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Steps to use Blast

#1) Paste sequence here

#2) Choose search set

(Either nucleotide collection or Protein Data Bank)

#3) select program to use

#4 push blast button

Page 17: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

This is the length of your query (in this case it was nucleotides)

The number of sequences in the database

The number of letters (base pairs) in the databaseRed, pink, and green are good matches

Blast output #1

Page 18: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

How good is your BLAST hit?The bit score: a measure of the statistical significance of the score (The higher the score the better and matches <50 are unreliable)

E-value: it is the number of times that your database match may have occurred by chance. The lower (closest to zero) the better, matches above 0.001 are close to the “twilight zone”

Click here next to get to this genbank entry

Page 19: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

A GenBank file

Name of the

gene (ureC)

Product

Function

Organism from which the sequence was characterized

List of annotated features

Structural

annotation

Page 20: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Once you find some protein sequences with BLAST, copy and paste in word or a text editor

Note: each one will need a FASTA header with the organism name following as the first line

Page 21: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

MultAlin: conduct protein sequence alignments from the BLASTP search

B Asx Aspartic acid or Asparagine

Z Glx Glutamine or Glutamic acid

Page 22: Making Sense of DNA and protein sequence analysis tools (course #2)  Dave Baumler Genome Center of Wisconsin,

Its your turnhttp://www.ncbi.nlm.nih.gov/Class/minicourses/

Choose Course #2: Making sense of DNA and protein sequences

Questions to consider as they work through these exercises:

#1) What aspects of the tools/resources are confusing or problematic? What questions do you think your students would have?

#2) How can we design similar exercises for our classes that are more compelling? How can we make the students more engaged, invested and motivated to learn?

#3 Group compilation of additional resources/websites that might be even better or more intuitive than the NCBI tools?