Download - Sequence Analysis
Sequence Analysis
Hemant KelkarCenter for Bioinformatics
University of North CarolinaChapel Hill, NC 27599
Scope of Series
Talk I
• Overview and BLAST
Talk II
• Protein analysis/Sequence Alignment
Talk III
• Evolution
• Genomics and challenges
Bioinformatics
• Mathematical, Statistical and computational methods that are used for solving biological problems
• Glue that holds the “omics” data together
Help …
• Is “my sequence” in the databases?• Is it similar to any sequence in the DB?• Does it have any know motifs/domains
that can help in identification?• Is there a structural homolog?• Are there any polymorphisms?• Genetic Map location?
Bioinformatics TOOLS!
Bioinformatics Tools
• Genetic Code
• Protein Structure
• Protein Evolution
Similarity search e.g. BLAST, FASTA
http://restools.sdsc.edu/biotools/biotools9.html
e.g. CLUSTALW, T-COFFEE, Phylip
Primary Sequence Databases
• GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html
) • PIR (http://pir.georgetown.edu/) • Swiss-Prot (http://us.expasy.org/sprot/)
Sequence information as is generated in the laboratory
Derived Sequence Databases
• PFAM (http://www.sanger.ac.uk/Software/Pfam/) : Protein families based on HMM models
• InterPRO (http://www.ebi.ac.uk/interpro/) : Protein families and domains based on functional sites
• TransFac (http://www.gene-regulation.com/) transcription factor db
• Cytochrome P450 database (http://drnelson.utmem.edu/CytochromeP450.html)
Databases based on functional or phylogenetic analysis
Derived Sequence Databases
• Flybase (http://www.flybase.org/) : Fly Genome
• Wormbase (http://www.wormbase.org/) : C. elegans
• Genome Browser (http://genome.ucsc.edu/) :
Human and Mouse • MGI (http://www.informatics.jax.org/) : Mouse
• Microbial Genome Resource : (http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl)
Databases based on taxonomy
Sequence Alignments
• Provide a measure of relation between the nucleotide or protein sequence
• This allows us to decipher:
Structural relationships
Functional relationships
Evolutionary relationships
Sequence Similarity Searches
• Information conserved evolutionarily
• DNA sequences NOT coding for proteins/rRNAs diverge rapidly• When possible use protein sequences for similarity searches
• Non-homologous protein identification is much less reliable• What is measured and what is inferred?
Similarity
• Is always based on an observable
• Usually expressed as % identity
• Quantifies the divergence of two sequences
• substitutions/insertions/deletions
• Residues crucial for structure and/or function
Homology
• Homology always implies that the molecules share a common ancestor
• Absolute answer
• Molecules ARE or ARE NOT homologous
• No degrees
How to Find Similar Sequences
• Global Sequence Alignments
• Sequence comparison along entire length
• Homolog of similar length• Local Sequence Alignments
• Similar regions in two sequences
• Regions outside the local alignment excluded
• Sequences of different length/similarity
Dotplot
Scoring Matrices
• Empirical weighting schemes
• Considers important biology
• Side chain chemistry/structure/function
• Functional/Structural Conservation
• Ile/Val – small and hydrophobic
• Ser/Thr – both polar
• Size/Charge/Hydrophibicity
Nucleotide Matrix
A C G TA 5 -4 -4 -4C -4 5 -4 -4G -4 -4 5 -4T -4 -4 -4 5
PAM Scoring Matrices
• Margaret Dayhoff (1978)
• Point accepted mutations (PAM)
• Patterns of substitutions in highly related proteins (>85% identical), based on multiple sequence alignments
• New side chains must function similarly
• 1 PAM 1 AA change per 100 AA
• 1 PAM ~ 1 % Divergence
BLOSUM Matrices
• Henikoff and Henikoff (1992)
• Blocks Substitution Matrices
• Differences in conserved ungapped regions
• Directly calculated no extrapolations
• Sensitive to structural/functional subs
• Generally perform better for local similarity searches
Scoring Matrix – BLOSUM62
BLOSUM n
• Calculated from sequences sharing no more than n% identity
• Sequences with more than n% identity are clustered and weighted to 1• Reducing the value of “n” yields more divergent/distantly-related sequences • BLOSUM62 used as default by many of the online search sites
Matrices and more
PAM Matrices (Altschul, 1991)
PAM 40 Short alignments >70%
PAM120 >50%
PAM250 Longer weaker local areas >30%
BLOSUM Matrices (Henikoff, 1993)
BLOSUM 90 Short alignments >60%
BLOSUM 80 >50%
BLOSUM 62Commonly used >35%
BLOSUM 30 Longer, weaker local alignments
Gaps
• Compensate for insertion and deletions• Improvement alignments
• Must be kept to a reasonably small number • 1 per 20 residues is logical
• Need a different scoring scheme
Gap Penalties
• Penalty for gap introduction
• Penalty for Gap extension
where G = gap-opening penalty 511
L = Gap-extension penalty 21
n = Length of gap
Deductions for Gap = G + Ln
NucProt
BLAST
• Basic Local Alignment Search Tool
• Seeks high-scoring segment pair (HSP)
• Sequences that can be aligned w/o gaps
• have a maximal aggregate score
• score be above score threshold S• Many HSP reported for ungapped blast
BLAST Algorithms
Program Query TargetBLASTN Nucloetide NucleotideBLASTP Protein ProteinBLASTX Nucleotide Protein
(6-Frame)
TBLASTN Protein Nucleotide (6FR)TBLASTX Nucloetide(6FR) Nucloetide(6FR)
Neighborhood Words
Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE
STL13
SAL8
SNL8
SVL8
SBL7
SCL7
SDL7
Etc.
= 4 + 5 + 4
Neighborhood Score Threshold
(T = 8)
Query Word (W = 3)
High-Scoring Segment Pairs
STL13
SAL8
SNL8
SVL8
SBL7
SCL7
SDL7
Etc.Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE
++ G + ++G G+GKS+LLSA L L+ ++G + Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS
Extension
Significance Decay
• Mismatches
• Gap penalties
Extension
Cumulative Score
X
S
T
Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G +
Sbjct: TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS
Karlin Altschul Equation
E = kmNe-λs
m Number of letters in query
N Number of letters in db
mN Size of search space
λs Normalized score
k minor constant
http://www.ncbi.nlm.nih.gov