theory and application of multiple sequence alignmentstheory and application of multiple sequence...
TRANSCRIPT
Theory and Application of Multiple Sequence Alignments
Brett Pickett, PhD
a.k.a What is a Multiple Sequence Alignment,
How to Make One, and What to Do With It
History
• Structure of DNA discovered (1953)
• First (phage) genome determined in 1977
• Human genome project begun in 1990
• First living organism (H.i.) sequenced in 1995
• Human “Rough draft” completed in 2000
– NHGRI (public) vs. J. Craig Venter (private)
• Used “super” computer to put human genome together in right order
What is a Genome?
• Genetic material required for organism to replicate – Eukaryotes (Humans): # chromosomes
– Prokaryotes (Bacteria): 1 chromosome
– Viruses: “what’s a chromosome?”
– 10 trillion cells in human body X 2m = 3.2 Gb • 780,000 times around Earth
• 67.8 roundtrips to the sun
– Bacteria (580 kb- 10 Mb)
– Virus (3.5 kb – 1.3 Mb)
http://www.rsc.org/chemsoc/timeline/pages/2001.html
Why are Genomes so Important?
• Encode all organismal functions
– DNA -> RNA -> protein
• Unique to each organism
– Find differences (mutations) only by comparing genomes with each other
www.thednastore.com/images/cells/mrdna1.jpg
How are Sequences Made? 1. Make lots of copies of original sequence (PCR)
2. Put the copies into a machine to make even more copies
3. Fluorescent (glow-in-the-dark) bases get incorporated randomly into new DNA molecule
4. Laser detects glowing bases and tells the computer the order of bases = sequence
http://bjpsbiotech.edublogs.org/files/2007/12/electropherogram.jpg
What’s the Next Step?
• After sequence is determined, then what?
• Make sense of it by comparing with other related (homologous) sequences
– Multiple Sequence Alignment
What is an Alignment?
• Lining up related (homologous) positions
– Allows comparison
Unaligned
Aligned
Comparing Sequences (Genomes)
• All DNA contains a unique genetic “fingerprint”
• Similarity reveals
– Related function
– Shared evolutionary history
education.vetmed.vt.edu/.../FINGERPRINT.jpg
Aligning with Computational Methods
• Computers can’t “see” patterns
– Use math to find best alignment by assigning scores
– Match
– Mismatch
– Gap
• Internal – Insertion / deletion (indel)
• Terminal – Missing information?
What is a Gap?
• Allows bases to be lined up even if sequences are different lengths
– Insertions / deletions (indels)
• Impossible to tell which sequence has lost (gained) information
– Terminal gaps
• Sequence is either naturally shorter or artificially cutoff
Mismatches Gaps
Nucleotide Alignment
• Custom Scores – Match – Mismatch – Gap-opening penalty
• Penalized for not having letter (begin a gap) • Why?
– Gap-extension penalty • Little or no penalty for lengthening a gap • Why?
– Scores balance between mismatch &
gap
Dynamic Programming
• Used to calculate alignment
– Breaks a very complicated process into smaller steps
– Helps computers to solve the problem faster
Sequence 1
Sequ
en
ce 2
Math
Read
http://www.myspacepimper.com/images/232763/Disney-s-Goofy-Baking-a-Cake.htm
Manual Alignment
Sequence A A T C
0 0 0 0 0
A 0
-4 5 -4
5
1 5 -4
5
1 -2 -4
1
-3 -2 -4
-2
T 0
-4 -2 1
1
-3 3 1
3
-1 10 -3
10
6 -1 -6
6
C 0
-4 -2 -3
-2
-6 -1 -1
-1
-5 1 6
6
2 15 2
15
Match = 5 Mismatch = -2 Gap Opening = -4 Gap Extension = 0
Traceback: Follow the highest scores back to the beginning Up or sideways = gap, diagonal = homology (line up)
A
A
A
-
T
T
C
C
Computer-Generated Alignment
• Much faster than we are
– 2 GHz = 2B calculations per second
– Don’t get tired, make mistakes, or get handcramps
Alignment Process
Types of Alignment
• Global
– Aligns entire sequence
– Permits gaps
– Forced even if sequences not homologous
• Local
– Aligns longest region possible with minimal (no) gaps
Beware!
• The computer is not always right
– Alignments
• Optimal: highest score
• True: evolutionarily correct
– Can be improved
• Hard for computer to accurately place indels (gaps) – Apply prior knowledge--codons
- AAA CCC
Lys Pro
AA- ACC C
??? Thr ?
Asn
Lys
vs. Nucleotide Sequence Amino Acid Sequence
BLAST
• Basic Local Alignment Search Tool
– Most frequently used alignment tool
– Local alignment of 1 sequence (query) against all known sequences (subjects) in database
• Uses a “heuristic” to reduce number of sequences it actually has to align – Like using “Google” to find most homologous sequences
BLAST Input
BLAST Output
How Does This Impact Me?
• Human Microbiome project – Sequence all bacteria in intestines
• Millions of bacteria in each gram of excrement – Which ones make us sick? How different is flora between people?
• Ocean Virus Metagenomics project – Try to get an idea of virus diversity across the globe
• Boat goes around N.A. collecting samples – Billions of viruses in each gallon of seawater
How Does This Impact Me (cont’d)?
• Used to take swabs, grow colonies on agar
– Antimicrobial resistance in turkeys
• Sequencing removes middle step
• How to quickly assign genus and species to new sequences?
– BLAST
• Project: New Phage from ponds
Other Uses for Alignments
SNP Detection
• Single Nucleotide Polymorphism
– Genetic changes occurring in at least one sequence
– May have biological significance
• Antibiotic resistance
• Changes could avoid detection by immune system
• Cause of genetic disease (CF)
Phylogenetic Trees
• Computer generated by: – Examining alignment
– Looking for shared mutations
• Show relationship(s) between sequences – History of sequences
• Where they came from
• Genetic changes that have occurred
CY065067
CY061195
CY065107
GU562458
CY065059
CY098563
CY098130
CY065011
CY061578
Clade
Node
Leaf
iOS Phylogram App (Free)
Branch
Recombination
• Can occur in all types of organisms – Eukaryotes – Prokaryotes – Viruses
• May change characteristic of organism – Make you sick (or not) – Not recognized by immune system – Fast way of getting lots of genetic changes
Breakpoint
RdRP
Genome 1
Genome 2
Daughter Sequence
Major Parent
Minor Parent
Reassortment
• Chromosomes (segments) from one organism replace those from another
– May change characteristic of organism
• Make you sick (or not)
• Not recognized by immune system
• Fast way of getting lots of genetic changes
+ =
Other Analysis Options
• Align Sequences
• Look for genetic changes (genotype) that are associated with traits (phenotype) – Host
– How sick it makes you
– Drug resistance
– Inherited disease
• Do any mutations consistently accompany the traits? – Genome Wide Association
Studies
http://lovestats.wordpress.com/dman/
How Does an Alignment Get a Score?
• Amino acids
– Identical >> Similar >> Dissimilar
Score Lookup Table (Matrix)
Symmetrical Positive Scores on Diagonal (Matches)
Some Mismatches get Negative Scores
Some Mismatches don’t